Wednesday, April 2, 2008

Installation of SQL Server 2005 x64 / Windows 2003 in cluster on HP ProLiant DL380 G5 servers

We recently had to install a brand new 2-node cluster made of HP ProLiant DL380 G5 (2 x Intel Xeon X5450 3.0 GHz QuadCore + 8 GB RAM). As there are many things to keep in mind before going for the installation (and a few problems too), I made a quick summary of things to remember in case I have to do it again some day (and can be helpful to you reader of this post).
  • Configure the basis of the cluster. SQL will need 2 important things: an MS DTC resource (don't forget to enable the network DTC access in Windows Components/Application Server, cfr. screenshot) and a resource group with the drive(s) you'll want to use with SQL. You don't need to create IP addresses and names, SQL will create them during the installation.

  • Configure security options of DTC, in Administrative Tools / Component Services. This is done on the cluster node having the DTC resource at the moment. In 'My Computer', select the 'MSDTC' tab and click on 'Security Configuration'. In the dialog box that appears, tick all option boxes.

  • Care should be taken when configuring the network card used for intra-cluster communication. An MSDN article explains everything clearly: Additional steps, suggested by Microsoft Support, include changing the priority of the cards so that the card used between the cluster nodes has a higher priority than the one connected to the network.

So far, this would be sufficient on a typical Windows Server 2003 . In our case, an additional hurdle was present, due to the specific NIC present in our machines, the NC373i. For no apparent reason, the setup would crash with the plain "There was an unexpected failure during the setup wizard. You may review the setup logs and/or click the help button for more information". And sometimes, just crash silently, leaving a MiniDump I can't make anything with behind... The only helpful message in the log files was:
Failed to find property "ComputerList" {"SqlComputers", "", ""} in cache
Source File Name: datastore\clusterinfocollector.cpp
Compiler Timestamp: Fri Sep 16 13:20:12 2005
Function Name: ClusterInfoCollector::collectProperty
Source Line Number: 182
Microsoft Support helped me on this one, and led me into the source of the problem: the advanced features of Windows 2003 SP2, the Scalable Networking Pack. Used with compatible hardware (like our NC373i NIC), it can increase networking performance greatly. But in the present case, it can cause some clustering features to stop working. MS issued a patch to disable these features (accessible here: KB 948496), but this was not enough.

To disable these options at the NIC level, you need to use the HP Networking Configuration Utility (cfr. screenshot below), and disable all options with 'offload' at the end of their name. I had also disabled RSS (Receive-Size Scaling) as per MS Support recommandations, but I see that after several updates and subsequent reboots, the option has magically been reenabled, but without negative effects. I left it activated. I only changed these settings on the private (for intra-cluster communications) card, the card connected to the network was left alone.

We have lost around 4 weeks with this issue, not knowing where to look. We hope this post will make you lose less time on this problem ;-)

Update: when we finally got a scheduled maintenance day, we launched the setup to install the new cluster in production, and guess what, same error!! Something happend since then... But it is certainly related to the TCP Offload Engine and Receive Side Scaling stuff, so I Googled a bit and found this article:

What worked for us this time is: reset everything to default in the HP Network Configuration Utility (even TOE and RSS), but disable these features using the NETSH command:

Netsh int ip set chimney DISABLED

After a reboot, setup performed its duty as expected. What a mess...


Rob said...

Did you make the NIC changes to all nodes of the cluster?

Frédéric Mauroy said...

Yes, I did it on each node, and all NICs, as this command is global. I verified that all NICs had their settings back to their default values in th HP Utility also.

pestsmitta said...

Excellent entry, disabling offload functionality and RSS did the trick after two weeks of frustration on the same hardware with pretty much the same error messages.