Share via


Event ID: 5014, 5004 The DFS Replication Service is stopping communication with partner / Error 1726 (The remote procedure call failed.)

Question

Wednesday, October 15, 2014 5:39 PM

I'm replicating between two servers in two sites (Server A - Server 2012 R2 STD, Server B - Server 2008 R2) over a VPN (Sonicwall Firewall).  Though the initial replication seems to be happening it is very slow (the folder in question is less than 3GB).  I'm seeing these in the event viewer every few minutes:

The DFS Replication service is stopping communication with partner PPIFTC for replication group FTC due to an error. The service will retry the connection periodically.

Additional Information:

Error: 1726 (The remote procedure call failed.)

and then....

The DFS Replication service successfully established an inbound connection with partner PPIFTC for replication group FTC.

Here are all my troubleshooting steps (keep in mind that our VPN is going through a SonicWall <--I increased the TCP timeout to 24 hours):

-Increased TCP Timeout to 24 hours 

-Added the following values on both sending and receiving members and rebooted server

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

 

Value =DisableTaskOffload

Type = DWORD

Data = 1

  

Value =EnableTCPChimney

Type = DWORD

Data = 0

  

Value =EnableTCPA

Type = DWORD

Data = 0

  

Value =EnableRSS

Type = DWORD

Data = 0

more troubleshooting

-Disabled AntiVirus on both members

-Made sure DFSR TCP ports 135 & 5722 are open

-Installed all hotfixes for 2008 R2 (http://support.microsoft.com/kb/968429) and rebooted

-Ran NETSTAT –ANOBP TCP and the DFS executable results are listed below:

Sending Member:

[DFSRs.exe]

  TCP    10.x.x.x:53            0.0.0.0:0              LISTENING       1692

[DFSRs.exe]

  TCP    10.x.x.x:54669         10.x.x.x:5722          TIME_WAIT       0

  TCP    10.x.x.x:54673         10.x.x.x:5722          ESTABLISHED     1656

 [DFSRs.exe]

  TCP    10.x.x.x:64773         10.x.x.x:389           ESTABLISHED     1692

[DFSRs.exe]

  TCP    10.x.x.x:64787         10.x.x.x:389           ESTABLISHED     1656

 [DFSRs.exe]

  TCP    10.x.x.x:64795         10.x.x.x:389           ESTABLISHED     2104

Receiving Member:

[DFSRs.exe]

  TCP    10.x.x.x:56683         10.x.x.x:389           ESTABLISHED     7472

 [DFSRs.exe]

  TCP    10.x.x.x:57625         10.x.x.x:54886         ESTABLISHED     2808

[DFSRs.exe]

  TCP    10.x.x.x:61759         10.x.x.x:57625         TIME_WAIT       0

  TCP    10.x.x.x:61760         10.x.x.x:57625         TIME_WAIT       0

  TCP    10.x.x.x:61763         10.x.x.x:57625         TIME_WAIT       0

  TCP    10.x.x.x:61764         10.x.x.x:57625         TIME_WAIT       0

  TCP    10.x.x.x:61770         10.x.x.x:57625         TIME_WAIT       0

  TCP    10.x.x.x:61771         10.x.x.x:57625         TIME_WAIT       0

  TCP    10.x.x.x:61774         10.x.x.x:57625         TIME_WAIT       0

  TCP    10.x.x.x:61775         10.x.x.x:57625         TIME_WAIT       0

  TCP    10.x.x.x:61776         10.x.x.x:57625         TIME_WAIT       0

  TCP    10.x.x.x:61777         10.x.x.x:57625         TIME_WAIT       0

  TCP    10.x.x.x:61778         10.x.x.x:57625         TIME_WAIT       0

  TCP    10.x.x.x:61779         10.x.x.x:57625         TIME_WAIT       0

  TCP    10.x.x.x:61784         10.x.x.x:52757         ESTABLISHED     7472

[DFSRs.exe]

  TCP    10.x.x.x:63661         10.x.x.x:63781         ESTABLISHED     4880

more troubleshooting

-Increased Staging to 32GB

-Opened the ADSIedit.msc console to verify the "Authenticated Users" is set with the default READ permission on the following object:

a. The computer object of the DFS server

b. The DFSR-LocalSettings object under the DFS server computer object

-Ran ping <var>10.x.x.x</var> -f -l 1472 and got replies back from both servers

-AD replication is successful on all partners

-Nslookup is working so DNS is working

-Updated NIC drivers on both servers

- I ran the following to set the Primary Member:

dfsradmin Membership Set /RGName:<replication group name> /RFName:<replicated folder name> /MemName:<primary member> /IsPrimary:True

Then Dfsrdiag Pollad /Member:<member name>

I'm seeing these errors in the dfsr logs:

20141014 19:28:17.746 9116 SRTR   957 [WARN] SERVER_EstablishSession Failed to establish a replicated folder session. connId:{45C8C309-4EDD-459A-A0BB-4C5FACD97D44} csId:{7AC7917F-F96F-411B-A4D8-6BB303B3C813} Error:
+ [Error:9051(0x235b) UpstreamTransport::EstablishSession upstreamtransport.cpp:808 9116 C The content set is not ready]
+ [Error:9051(0x235b) OutConnection::EstablishSession outconnection.cpp:532 9116 C The content set is not ready]
+ [Error:9051(0x235b) OutConnection::EstablishSession outconnection.cpp:471 9116 C The content set is not ready]

more troubleshooting

I've done a lot of research on the Internet and most of it is pointing to the same stuff I've tried.  Does anyone have any other suggestions?  Maybe I need to look somewhere else on the server side or firewall side? 

I tried replicating from a 2012 R2 server to another 2012 server and am getting the same events in the event log so maybe it's not a server issue. 

Some other things I'm wondering:

-Could it be the speed of the NICs?  Server A is a 2012 Server that has Hyper-V installed.  NIC teaming was initially setup and since Hyper-V is installed the NIC is a "vEthernet (Microsoft Network Adapter Multiplexor Driver Virtual Switch) running at a speed of 10.0Gbps whereas Server B is running a single NIC at 1.0Gbps

-Could occasional ping timeout's cause the issue?  From time to time I get a timeout but it's not as often as the events I'm seeing.  I'm getting 53ms pings.  The folder is only 3 GB so it shouldn't take that long to replicate but it's been days.  The schedule I have set for replication is mostly all day except for our backup times which start at 11pm-5am.  Throughout the rest of the time I have it set anywhere from 4Mbps to 64 Kbps.  Server A is on a 5mb circuit and Server B is on a 10mb circuit. 

All replies (7)

Friday, October 17, 2014 9:34 AM

Hi,

I am trying to involve someone familiar with this topic to further look at this issue. There might be some time delay. Appreciate your patience.

Thank you for your understanding and support.

Regards.

Vivian Wang


Wednesday, October 22, 2014 2:58 PM

Hi,

Please try to Turn off the SNP feature on two servers and check if it helps:

netsh interface tcp set global autotuning=disabled 
netsh interface tcp set global chimney=disabled 
netsh interface tcp set global rss=disabled

Here are two related articles for your reference.
Information about the TCP Chimney Offload, Receive Side Scaling, and Network Direct Memory Access features in Windows Server 2008
http://support.microsoft.com/kb/951037 
Update to turn off SNP features for Windows Server 2003 and Windows SBS 2003
http://support.microsoft.com/kb/948496 

Thanks.


Wednesday, October 22, 2014 4:57 PM

I've done this already:

netsh interface tcp set global autotuning=disabled 
netsh interface tcp set global chimney=disabled 
netsh interface tcp set global rss=disabled

What SNP feature are your referring to?  How do I turn it off?  Thanks for your suggestions. 


Thursday, October 30, 2014 3:29 AM

Hi,

SNP was built to increase performance on the networking stack, and does for newer network adapters that support it. However, for older network adapters, it can cause significant delays which cause timeout problems and connection problems leading to various problems. The command I give you is used to disable the SNP feature.

http://blogs.msdn.com/b/pcreehan/archive/2007/10/16/ooh-snp.aspx

Thanks.


Thursday, October 30, 2014 12:06 PM

Hi,

SNP was built to increase performance on the networking stack, and does for newer network adapters that support it. However, for older network adapters, it can cause significant delays which cause timeout problems and connection problems leading to various problems. The command I give you is used to disable the SNP feature.

http://blogs.msdn.com/b/pcreehan/archive/2007/10/16/ooh-snp.aspx

Thanks.

If you read my original post I already tried these registry keys and they did not work.  Thank you for the suggestion anyway. 


Friday, January 16, 2015 7:12 PM

I'm seeing the same errors, all servers are running 2008 R2 x64. Across multiple sites, VPN is steady and reliably.

185 events from 12:28:21 to 12:49:25

Events are for all five servers (one per office, five total offices, no two in the same city, across three states).

Events are not limited to one replication group. I have quite a few replication groups, so I don't know for sure but I'm running under the reasonable assumption that none are spared.

Reminder from original post (and also, yes, same for me), the error is: Error: 1726 (The remote procedure call failed.)

Some way to figure out what code triggers an Event ID 5014, and what code therein specifies an Error 1726, would extremely helpful. Trying random command line/registry changes on live servers is exceptionally unappealing.

Side note, 1726 is referenced here:

https://support.microsoft.com/kb/976442?wa=wsignin1.0

But it says, "This RPC connection problem may be caused by an unstable WAN connection." I don't believe this is the case for my system.

It also says...

For most RPC connection problems, the DFS Replication service will try to obtain the files again without logging a warning or an error in the DFS Replication log. You can capture the network trace to determine whether the cause of the problem is at the network
 layer. To examine the TCP ports that the DFS Replication service is using on replication partners, run the following command in a
Command Prompt window: 



NETSTAT –ANOBP TCP

This returns all open TCP connections. The connections in question are "DFSRs.exe", which the command won't let you filter for.

Instead, I used the NETSTAT command as advertised, dumping output to info.txt:

NETSTAT -ANOBP TCP >> X:\info.txt

Then I opened Excel and manually opened the .TXT for the open wizard. I chose fixed-width fields based on the first row for each result, and then added a column:

=IF(A3="Can not", "Can not obtain ownership information", IF(LEFT(A3,1) = "[", A3&B3&C3, ""))

Dragging this down through the entire file let me see that row (Row F) as the file name. Some anomalies were present but none impacted DFSrs.exe results.

Finally, you can sort/filter (I sorted because I like being able to see everything, should I choose to) to get just the results you need, with the partial rows removed from the result set, or bumped to the end.

My server had 125 connections open.

That is a staggering number of connections to review, and I feel like I'm looking for a needle in a haystack.

I'll see if I can find anything useful out, but a better solution would be most wonderful.


Thursday, May 31, 2018 2:25 PM

Every resolved?