- Configure the basis of the cluster. SQL will need 2 important things: an MS DTC resource (don't forget to enable the network DTC access in Windows Components/Application Server, cfr. screenshot) and a resource group with the drive(s) you'll want to use with SQL. You don't need to create IP addresses and names, SQL will create them during the installation.
- Configure security options of DTC, in Administrative Tools / Component Services. This is done on the cluster node having the DTC resource at the moment. In 'My Computer', select the 'MSDTC' tab and click on 'Security Configuration'. In the dialog box that appears, tick all option boxes.
- Care should be taken when configuring the network card used for intra-cluster communication. An MSDN article explains everything clearly: http://support.microsoft.com/kb/258750. Additional steps, suggested by Microsoft Support, include changing the priority of the cards so that the card used between the cluster nodes has a higher priority than the one connected to the network.
So far, this would be sufficient on a typical Windows Server 2003 . In our case, an additional hurdle was present, due to the specific NIC present in our machines, the NC373i. For no apparent reason, the setup would crash with the plain "There was an unexpected failure during the setup wizard. You may review the setup logs and/or click the help button for more information". And sometimes, just crash silently, leaving a MiniDump I can't make anything with behind... The only helpful message in the log files was:
Failed to find property "ComputerList" {"SqlComputers", "", ""} in cacheMicrosoft Support helped me on this one, and led me into the source of the problem: the advanced features of Windows 2003 SP2, the Scalable Networking Pack. Used with compatible hardware (like our NC373i NIC), it can increase networking performance greatly. But in the present case, it can cause some clustering features to stop working. MS issued a patch to disable these features (accessible here: KB 948496), but this was not enough.
Source File Name: datastore\clusterinfocollector.cpp
Compiler Timestamp: Fri Sep 16 13:20:12 2005
Function Name: ClusterInfoCollector::collectProperty
Source Line Number: 182
To disable these options at the NIC level, you need to use the HP Networking Configuration Utility (cfr. screenshot below), and disable all options with 'offload' at the end of their name. I had also disabled RSS (Receive-Size Scaling) as per MS Support recommandations, but I see that after several updates and subsequent reboots, the option has magically been reenabled, but without negative effects. I left it activated. I only changed these settings on the private (for intra-cluster communications) card, the card connected to the network was left alone.
We have lost around 4 weeks with this issue, not knowing where to look. We hope this post will make you lose less time on this problem ;-)
Update: when we finally got a scheduled maintenance day, we launched the setup to install the new cluster in production, and guess what, same error!! Something happend since then... But it is certainly related to the TCP Offload Engine and Receive Side Scaling stuff, so I Googled a bit and found this article: http://forums12.itrc.hp.com/service/forums/bizsupport/questionanswer.do?admit=109447627+1209987026791+28353475&threadId=1153566
What worked for us this time is: reset everything to default in the HP Network Configuration Utility (even TOE and RSS), but disable these features using the NETSH command:
Netsh int ip set chimney DISABLED
After a reboot, setup performed its duty as expected. What a mess...
3 comments:
Did you make the NIC changes to all nodes of the cluster?
Yes, I did it on each node, and all NICs, as this command is global. I verified that all NICs had their settings back to their default values in th HP Utility also.
Excellent entry, disabling offload functionality and RSS did the trick after two weeks of frustration on the same hardware with pretty much the same error messages.
Post a Comment