So recently the company I work for decided to rebuild a large SCCM (or ConfigMgr for those of you in MS land) primary site onto new hardware. To give you a brief glimpse into the site configuration, this site has ~20,000+ clients assigned to it and the server houses the management point, software update point, reporting point, and distribution point. There are also quite a few secondary sites below it.
We did the recovery on a Friday and early Saturday morning we started to see the following in our event log:
The server was unable to find a free connection 148 times in the last 60 seconds. This indicates a spike in network traffic. If this is happening frequently, you should consider increasing the minimum number of free connections to add headroom. To do that, modify the MinFreeConnections and MaxFreeConnections for the LanmanServer in the registry.
This basically killed all file shares on the box (the server service had died). This made it so the despooler.boxreceive share was dead, so our central site couldn't send down to this newly upgraded child primary. Restarting the server resolved the issue, but only for about 10 hours and the issue started to happen again.
We spent most of Monday on the phone with Microsoft diagnosing the issue, but with no real fix. We made some modifications to HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesLanmanserverparameters added adjusted/added the MaxFreeConnections, MinFreeConnections, MaxWorkItems, MaxRawWorkItems, InitWorkItems, and MaxMpxCt. I won't reference what we changed these to as these didn't solve this issue, but suffice it to say, we were able to run for about 14 hours before the shares died again.
In doing my research, I've found that Symantec could be the source of the problem as referenced here and here. However the articles that are listed refer to having Symantec Endpoint Protection, which within our environment doesn't exist. We do use symantec, just not the endpoint client.
So in another round of research I found that some other people were disabling symantec. While we haven't decided to take that step just yet, we've made some modifications to the way we deal with symantec and its auto protection scanning settings. Once testing is complete, I'll post exactly what we did in hopes of helping someone else who has this same problem. Stay tuned.
Update: So it appears we may have found the solution to our problem. It's a two-fold issue. The first was the shares being killed by Symantec. Yes, symantec was killing our shares. If you're using a version of symantec earlier than 10.2.3.3000 you'll want to update to maintenance release 3 (MR3). Doing this update should fix the issues with auto protect making network shares unresponsive.
Our other issue wasn't as easy to figure out, at least initially. What we were experiencing was 100% CPU utilization (across 16 cores mind you). The w3pw.exe service process was running multiple processes which in total were eating all of our CPU just after reboot. I found this odd, especially in the evening because most of our users were out of the office and most of the machines should have been asleep/off. So initially I didn't assume this to be a load issue, and more of a configuration issue (i.e. maybe IIS 7 needs some additional configuration).
Long story short, and after getting MS on the phone and talking with perf. engineers and such, we noticed that in policyagent.log that there was an errant policy body on many of the machines that we had at our disposal assigned to this newly re-built site. What was happening is that since the processing of this policy was failing that every client (~20,000) was trying to re-download this policy every 15 minutes! Apparently 4 quad core processors (sorry, don't have the processor models handy right now, 4 quad core AMD opteron's running at 2.3 Ghz I believe) will melt when IIS is processing MP requests for policy downloads every 15 minutes.
Looking closer at the policy body we were able to determine via a look up in the DB that the policy was referring to our network access account causing the snafu. So by changing the password on that account, the machines were able to download that newly updated policy and all was well. At least I hope...
More to come...hopefully not :)
Related External Links
- Generated by LinkCurl