Slow user logon's

We experienced a problem this weekend with one of our domain controllers taking a longer than the usual amount of time to boot and login. Doing some quick troubleshooting I was more than a bit confused as to why this was happening since there were virtually no errors in the logs but there were a couple warnings which at the time did not make much sense. Since this was a virtual DC and I had spent quite some time looking over logs and running the usual tests (dcdiag, nltest, etc) with no errors I decided what the heck lets demote it let it sit for a hour and then promote it again. OK well nice idea but the first real odd error came up then which was it said hey I can't talk to the other DC to offload my info. OK fine for some reason I can't recall now we ended up rebooting the other DC and even odder I now had no problems demoting the affected DC or promoting it some time later. It also replicated fine to all 15 of my other DC's. Everything seemed to be working, still no errors, and no visible issues. It was called good and chalked up to moon spots, black cats jumping on the server or who knows what. Monday everything still seemed OK. Then on Tuesday several users complained that it was taking 10 to 30 minutes to log in. This only seemed to happen when they got one specific DC in the site, the same one we had issues with before. The log in process was hanging at applying your personal settings.

I noticed this warning showing up in the logs

Log Name:      System
Source:        LsaSrv
Date:          9/23/2011 12:45:24 AM
Event ID:      40960
Task Category: None
Level:         Warning
Keywords:     
User:          SYSTEM
Computer:     DC1-2
Description:
The Security System detected an authentication error for the server LDAP/DC1.MYDOMAIN.local/MYDOMAIN. The failure code from authentication protocol Kerberos was "No authority could be contacted for authentication.
(0x80090311)".

-    Also on the client that got the slow DC I found that when we run gpupdate /force it takes a long time to come back and then results in below message.

H:\>gpupdate /force
Refreshing Policy...

User Policy Refresh has not completed in the expected time. Exiting...
User Policy Refresh has completed.
Computer Policy Refresh has completed

I now tried pinging the server with a packet size of 1472 (ping <servername> -f -l 1472) this failed with a request time out. I was able to ping with a packet size of 1450 from the client. 

During this same time while troubleshooting I had tried to RDP from the DC users had long login times on to the other DC that worked fine in the same site. When I did this it would connect show a black screen and then I got the following error.

Remote Desktop Disconnected. Your Remote Desktop session has ended. The connection to the remote computer was lost, possibly due to network connectivity problems. Try connecting to the remote computer again. If the problem continues, contact your network administrator or technical support. 

So we were apparently having network problems where the network was unable to transmit packet size 1472 to this DC. 

So on the 2 DC's in this site we changed the MTU size in the registry under
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\<AdapterID>

Create a new DWORD called MTU and set the decimal value to the MTU size you want. I used 1450 since that worked doing the test ping.

Rebooted both DC's and the client machines and I am now able to log in at a normal speed to both DC's and I am able to do a gpupdate from the clients.

Obviously I have glossed over a ton of troubleshooting steps we took. The entire process actually took quite some time to nail down what was going on which is why I felt this was an important article to share. I have included what I feel are the important details.

3 comments:

  1. Well this post put me in the right direction after a lot of similar aggravation, but the solution was the revers of yours. In my case I had a little site-to-site VPN to connect a branch office to an employee's home office. A ping -f in one direction (home office to branch) maxed out at about 1470 bytes even though the routers at both ends of the tunnel capped MTU at 1420. Ping from branch to home ws 1392 as expected (1420 MTU - 28 bytes for packet overhead). I wound up setting the MTU right on the client's home PC network interface (as opposed to the DC in your case) to 1420. That fixed all of the slow / intermittent network logons and drive mapping issues.

    ReplyDelete
  2. Thank you for this post. I had same troubles with my IPSec connection (HQ Office - Azure). And I could resolve it changing MTU at 1350 in my router (Firewall/Mangle rules).

    ReplyDelete
  3. You saved my life with this article!

    ReplyDelete