Cluster Server FAQ: Overview of Microsoft Cluster Server
Last updated on May 4, 1999

Cluster Overview

Cluster Overview


Cluster Basics
Intro to Microsoft Cluster Server
High Availability
Manageability
Scalability
Application and Service Support
Microsoft Cluster Server and Windows NT Load Balancing Service

Cluster Basics


What is a server "cluster"?
A server cluster is a group of independent servers managed as a single system for higher availability, easier manageability, and greater scalability.

What does it take to create a server cluster?
The minimum requirements for a server cluster are (a) two servers connected by a network, (b) a method for each server to access the other's disk data, and (c) special cluster software like Microsoft® Cluster Server (MSCS). The special software provides services such as failure detection, recovery, and the ability to manage the servers as a single system.

What are the benefits of server clustering?
There are three primary benefits to server clustering: improved availability, easier manageability, and more cost-effective scalability. Using Microsoft Cluster Server as an example:

What are clusters used for?
Customer surveys indicate that MSCS clusters will be used as highly available multipurpose platforms, mirroring the current uses of the Microsoft Windows NT® Server operating system. Surveyed customers suggested that the most common uses of MSCS clusters will be mission-critical database management, file/intranet data sharing, messaging, and general business applications.

When a cluster is recovering from a server failure, how does the surviving server get access to the failed server's disk data?
There are basically three techniques that clusters use to make disk data available to more than one server:

Intro To Microsoft Cluster Server


What is "Wolfpack"?
"Wolfpack" was the code name for Microsoft Cluster Server.

What is Microsoft Cluster Server (MSCS)?
MSCS is a built-in feature of Windows NT Server, Enterprise Edition. It is software that supports the connection of two servers into a "cluster" for higher availability and easier manageability of data and applications. MSCS can automatically detect and recover from server or application failures. It can be used to move server workload to balance utilization and to provide for planned maintenance without downtime. And, over time, MSCS will also become a platform for highly scalable, cluster-aware applications.

How many servers can be in an MSCS cluster?
The initial release of MSCS supports clusters with two servers. A future version referred to as MSCS "Phase 2" will support larger clusters, and will include enhanced services to simplify the creation of highly scalable, cluster-aware applications.

When will MSCS be available?
MSCS is shipping as part of windows NT Server 4.0 enterprise Edition.
Significant enhancements to Windows NT Server, Enterprise Edition are planned for Windows 2000 Server Enterprise Edition including the following key improvements:

Most of these features are visible in and supported in the Beta 2 for Windows 2000 Server Enterprise Edition, which shipped in August 1998.

What other companies were involved in the development of MSCS?
Microsoft worked closely with leading hardware vendors, software vendors, and customers in the specification and development of MSCS and its API. These other companies participated through five different programs:

In what languages will MSCS be available?
Microsoft Windows NT Server, Enterprise Edition 4.0, which included MSCS 1.0, is available in English, French, German, Japanese, and Spanish.

Through what channels is Windows NT Server, Enterprise Edition be available?
Microsoft Windows NT Server, Enterprise Edition is available to customers through all standard channels: reseller, retail, OEM, and the Microsoft Select licensing program.

What versions of Windows NT Server does MSCS support?
MSCS software is only available as a built-in feature of Windows NT Server 4.0, Enterprise Edition.

Will MSCS be extended beyond Windows NT Server to Windows NT Workstation?
There is currently no plan to extend cluster support to Windows NT Workstation. MSCS software has been designed and written to closely integrate with the architecture and features of Windows NT Server, including its server-oriented networking and directory services capabilities.

What clients can connect to an MSCS cluster?
Any client that can connect to Windows NT Server through TCP/IP will work with MSCS. This includes Microsoft MS-DOS®, Microsoft Windows® 3.x, Windows 95, Windows NT, Apple Macintosh, and UNIX. MSCS does not require any special software on the client for transparent recovery of services that connect to clients through standard IP protocols.

High Availability


How does MSCS provide high availability?
MSCS uses software "heartbeats" to detect failed applications or servers. In the event of a server failure, it employs a "shared nothing" clustering architecture that automatically transfers ownership of resources (such as disk drives and IP addresses) from a failed server to a surviving server. It then restarts the failed server's workload on the surviving server. All of this—from detection to restart—typically takes under a minute. If an individual application fails (but the server does not), MSCS will typically try to restart the application on the same server; if that fails, it moves the application's resources and restarts it on the other server. The cluster administrator can use a graphical console to set various recovery policies, such as dependencies between applications, whether or not to restart an application on the same server, and whether or not to automatically "failback" (rebalance) workloads when a failed server comes back online.

Can MSCS provide "zero downtime"?
No. MSCS can dramatically reduce planned and unplanned downtime. However, even with MSCS, a server could still experience downtime from the following events:

Microsoft recommends that clusters be used as one element in customers' overall programs to provide high integrity and high availability for their mission-critical server-based data and applications.

Is MSCS failover transparent to users?
MSCS does not require any special software on client computers, so the user experience during failover depends on the nature of the client side of their client-server application. Client reconnection is often transparent, because MSCS has restarted the applications, file shares, and so on, at exactly the same IP address.

If a client is using "state-less" connections such as a standard browser connection, then it would be unaware of a failover if it occurred between server requests. If a failure occurs while a client is connected to the failed resource, then the client will receive whatever standard notification is provided by the client side of the application in use when the server side becomes unavailable. This might be, for example, the standard "Abort, Retry, or Cancel?" prompt you get when using Windows Explorer to download a file at the time a server or network goes down. In this case, client reconnection is not automatic (the user must choose "Retry"), but the user is fully informed of what's happening and has a simple, well-understood method of reestablishing contact with the server. Of course, in the meantime, MSCS is busily restarting the service or application so that, when the user chooses "Retry," it reappears as if it never went away.

For client-side applications that have "state-full" connections to the server, a new logon is typically required following a server failure. In many cases, this approach is required for security purposes. For example, this is how SAP R/3 works—if the server connection is lost, the user is prompted to log on again to make sure it's the same user accessing the application.

Even with state-full connections, it's possible for an application to automatically reconnect following a failover. For example, when Microsoft demonstrated SAP R/3 failover at Microsoft Scalability Day in New York City on May 20, it was accessed through an Active browser application that had automatically (and securely) cached the user's ID and password from the initial logon. Thus, when the server connection was momentarily lost during the failover demo, the client application automatically logged on again using the cached ID and password. This was done using standard IP connections, running a simple Microsoft Visual Basic® development system program within an HTML document through the Microsoft ActiveX® technology.

When a server comes back online following a failure, is there any human intervention required to get it back "up and running," or is the heartbeat enough for the other server to include it once again?
No manual intervention is required. When a server running Microsoft Cluster Server, say "Server A," boots, it starts the MSCS service automatically. MSCS in turn checks the interconnect (and network if necessary) to find the other server in its cluster, say "Server B." If Server A finds Server B, then Server A rejoins the cluster and Server B updates it with current cluster status info. Server A then initiates "failback," moving back failed-over workload from Server B to Server A at an appropriate time.

What is "failback," and how does it work in MSCS?
"Failback" is the ability to automatically rebalance the workload in a cluster when a failed server comes back online. This is a standard feature of MSCS. For example, say "Server A" has crashed and its workload failed-over to "Server B." When Server A reboots, it automatically finds Server B and rejoins the cluster. It then checks to see if any of the cluster groups running on Server B would "prefer" to be running on Server A. If so, it automatically moves those groups from Server B to Server A as soon as the time is right. Failback properties—that is, which groups can failback, which is their preferred server, and during what hours the time is "right" for failback—are all set from the cluster administration console.

Can the servers in an MSCS cluster be located at separate locations for recovery from site disasters?
Not at this time. All of the cluster configurations currently being considered for validation use SCSI connections to storage resources, which limits the distance between clustered servers to the distance supported by standard SCSI. This is typically no more than 25 meters, though there are SCSI extender technologies that can potentially stretch the connection up to 1,000 meters.

Note that Windows NT Server customers already have several choices for software that can mirror data to remote disaster recovery sites, including solutions from N.S.I., Octopus, Veritas, and Vinca. Most of these vendors have already announced that their disaster site mirroring solutions will also work with MSCS clusters.

Can MSCS restore registry keys for an application from one server to the other when doing failover?
Yes. Recovery of an application's registry information is a configurable feature that is available to the Generic Application and Generic Service resource types. Basically, you tell it what registry keys to log and recover, and that's all there is to it. This capability should be used if the application or service stores volatile information in specific registry keys. If this is done, when the resource comes online on another node, it will have the same registry information as the previously online resource.

When an application restarts on another server following a failure, does it re-start from a copy of the application?
No. The new server (say, "Server 2") would start the application from the same physical disks as Server 1, since ownership of the application's disks on the shared SCSI bus had been moved from Server 1 to Server 2 as one of the first steps in the failover process. This approach assures that the application always restarts from its last known state, as recorded on its disk drives (and, if you use the available option, as recorded in its registry keys.)

Can MSCS restore an application's "state" at the time of its failure rather than requiring a complete restart?
MSCS can restore the state of an application's registry keys, but any other state information must be managed and restored by the application. Applications need to provide some model for persistence to insure that state can be recaptured. For example, Microsoft SQL Server™ uses transaction logs to provide this assurance. If a server running Microsoft SQL Server crashes, upon restart the application uses its transaction logs to bring the database back to a known state. With a cluster, just as with a single server, good application design and the use of ACID (Atomic, Consistent, Isolated, and Durable) transaction properties are important.

What is the granularity of resource failover?
MSCS supports failover of "virtual servers," which usually correspond to applications, Web sites, print queues, or file shares (including their disk spindles, files, IP addresses, and so on). MSCS also provides cluster-wide services that are simultaneously available on all servers in the cluster, including cluster administration, performance monitoring, event viewing, a cluster name, and cluster time synchronization.

What is a "quorum disk" and how does it help MSCS provide high availability?
It's a disk spindle that MSCS uses to determine whether or not another server is up or down. Technically, it's a resource that can only be owned by one server at a time, and for which servers can negotiate for ownership. Negotiating for the quorum drive allows MSCS to avoid "split brain" situations where both servers are active and think the other server is down. (This can happen when, for example, the cluster interconnect is lost and network response time is problematic.) The use of a quorum resource is one of the sophisticated algorithms that Microsoft got by working with pioneers in clustering such as Digital and Tandem.

Managebility


How does MSCS improve the manageability of servers?
MSCS gives administrators a graphical console from which they can monitor and manage all of the resources in a cluster as if it was a single system. Using the familiar standards of a Microsoft Windows graphical user interface, an administrator can use the cluster console to:

The ability to graphically move workload from one server to another with only a momentary pause in service (typically less than a minute) means administrators can easily unload servers for planned maintenance without taking important data and applications offline for long periods of time.

Does MSCS provide administrators with a "single system image"?
Yes. MSCS provides administrators a single graphical console to manage all of the applications and resources in a cluster. The MSCS console presents cluster resources by physical server, and by "virtual server" (or "cluster group"). This allows administrators to centrally manage the cluster as a collection of virtual application-oriented servers, or as a collection of physical resources when appropriate.

Can MSCS be remotely managed?
Yes. An authorized user can run the MSCS administration console from any Windows NT Workstation or Windows NT Server on the network. In the version of MSCS accompanying Windows 2000 Server, Enterprise Edition, the cluster administration console will be a "snap-in" to the Microsoft Management Console, providing scriptable, remoteable access, including access through Internet protocols from a browser.

How does MSCS help administrators do "rolling upgrades" of their servers?
With MSCS, server administrators no longer have to do all their maintenance within those rare windows of opportunity when no users are online. Instead, they can simply wait until a convenient off-peak time when one of the servers in the cluster has enough horsepower for all of the cluster workload. They then point-and-click to move all the workload onto one server, and they're ready to perform maintenance on the unloaded server. Once the maintenance is complete and tested, they bring that server back online and it automatically rejoins the cluster, ready for work. When convenient, the administrator repeats the process to perform maintenance on the other server in the cluster. This ability to keep applications and data online while performing server maintenance is often referred to as doing "rolling upgrades" to your servers.

Will Microsoft support "rolling upgrades" of future server products using MSCS clusters?
It is Microsoft's goal to support "rolling upgrades" between releases of Microsoft server software using MSCS clusters. However, we cannot commit to this for all releases of all products. Persistent storage formats must occasionally change to accommodate new capabilities, and changes in persistent storage occasionally require applications to be taken offline while storage or indices are restructured. Microsoft will commit to always providing smooth upgrades between releases of all our products, and we'll use MSCS to provide seamless rolling upgrades whenever possible.

Scalability


How will MSCS enhance server scalability?
The manageability benefits of Windows NT Server Enterprise Edition 4.0 simplify many of the processes currently used to improve scalability, such as upgrading server hardware and installing new versions of applications. A post Windows 2000 Server Enterprise Edition version of MSCS will support clusters containing larger numbers of servers, and will provide enhanced abilities that simplify the creation of highly scalable, cluster-aware applications.

Today, however, there are significant scalability advantages to clustering. For example:

Assuming that a customer is running cluster aware versions of the appropriate software products (the Enteprise Editions of SQL Server and Exchange) on both nodes, in each of the examples two clustered systems will provide scalability as well as availability advantages relative to a single, non-clustered system.

The Microsoft cluster strategy White Paper said MSCS is already architected for multiple nodes. Has MSCS been tested on multinode clusters? If so, why is Microsoft waiting to deliver multinode support?
Yes, Microsoft and other vendors have tested MSCS clusters with more than two servers. These clusters "work" in that they are stable and the administrator's console provides basic management for the multiserver environment. However, the algorithms and features in the current software must be extended and thoroughly tested on larger clusters before customers can reliably use a multinode MSCS cluster for production work, or gain enhanced cluster benefits. In addition, Microsoft will have to extend the cluster hardware validation procedures to accommodate the additional requirements of multinode clusters.

Microsoft has architected MSCS for multinode support in preparation for the coming "Phase 2" version. Today's multinode tests have proven the architecture is correct. However, there are two key reasons Microsoft is limiting the initial release to two-server clusters:

  1. Customer surveys show that 80 percent of the demand for clusters is to improve the availability of mission-critical data and applications. Two-server clusters satisfy this overwhelming customer requirement. Focusing on this customer requirement allowed Microsoft to focus its efforts, and the efforts of other vendors, on delivering very high-quality, high-availability clustering solutions in the initial release.
  2. One of the key requirements for developing scalable, cluster-aware applications is a globally accessible, programmable naming service that clients use to locate cluster resources. The enhanced Directory Services of Windows 2000 Server will be an excellent cluster naming service, so it was decided to develop MSCS "Phase 2" support for large, scalable clusters using the Active Directory of Windows 2000 Server.

Is it possible to add hard drives to an MSCS cluster without rebooting?
It depends on whether the drive cabinet supports this, since Windows NT will not do so until the Windows 2000 release. There are examples of RAID cabinets validated for Windows NT that support changing volumes on the fly (with RAID parity).

How will MSCS help do load balancing?
"Load balancing" is the ability to move work from a very busy server to a less-busy server. MSCS will support load balancing in four ways over time:

  1. Manual load balancing: With the initial release of MSCS, the person administering a cluster will be able to use the cluster console to point-and-click whole cluster groups (or, related applications and resources) from a loaded server to a less-loaded server. They can easily determine when server loads justify load balancing using the built-in Performance Monitor of Windows NT Server.
  2. Automatic cluster group load balancing: A future release of MSCS will allow administrators to specify performance-related failover policies for cluster groups, using the graphical cluster administration console. This would be similar, for example, to the way "fail back" policies are set in the initial release of MSCS. The administrator will go to a "Load Balancing" tab in the Properties window for a cluster group, and use point-and-click and fill-in-the-blank actions to specify which values of which Performance Monitor counters should trigger load-balancing failover.
  3. Automatic workload balancing in "cluster aware" applications: Over time, some software vendors will use the evolving services of MSCS to create a new generation of cluster-aware applications that automatically spread their workload over multiple servers in a cluster to achieve higher scalability. Examples that have already been publicly discussed include future versions of Microsoft SQL Server, Oracle Parallel Server, and Tandem NonStop SQL/MX.
  4. Automatic transaction load balancing through Microsoft Transaction Server: Today's Microsoft Transaction Server (MTS), part of Windows NT Server, provides multithreading as a "free" service that automatically improves the scalability of component-based applications running on single servers. A future release of MTS will similarly provide automatic distribution of transaction processing loads across the servers in a cluster as a "free" service. This will be the easiest way for corporate developers and application vendors to achieve cluster-enhanced scalability.

Should cluster-aware applications developed for MSCS use a shared-disk or shared-nothing architecture for greatest scalability?
Microsoft recommends a shared-nothing architecture for cluster-aware applications because of its greater scalability potential. With shared-disk applications, copies of the application running on two or more servers in the cluster share concurrent read/write access to a single set of disk files, mediating ownership of the files using a "distributed lock manager" (DLM). A shared-nothing application, on the other hand, avoids the potential bottleneck of shared resources and a DLM by partitioning or replicating the data so that each server in the cluster works primarily with its own data and disk resources. In theory, MSCS can support either type of application. However, Microsoft has no plans at this time to include a DLM in the MSCS cluster services, so vendors would have to develop or license a DLM to implement a shared-disk application on MSCS. Microsoft has chosen to use the shared-nothing architecture for future versions of Microsoft BackOffice® family applications because of that architecture's greater potential for cluster-enabled scalability.

Will MSCS ever have a Distributed Lock Manager (DLM)?
Microsoft will not include a distributed lock manager in the first release of MSCS. Enhancements in future releases will be determined based on customer requirements.

When will Microsoft offer a parallel version of Microsoft SQL Server that runs on multiple servers at the same time for automatic load balancing and scalability?
The next major release after Microsoft SQL Server 7.0 is planned to offer cluster-enabled scalability on MSCS clusters. It will use a scalable "shared nothing" architecture to spread a single database across multiple servers. A White Paper on the strategy for Microsoft SQL Server on clusters can be downloaded from http://www.microsoft.com/sql. Although this is an important direction for Microsoft SQL Server, it must be kept in perspective: It will only be needed by a small percent of customers. Cluster-enabled scalability will only be needed by extremely large enterprise applications which are (a) too large to run on a single high-end SMP server (for example, eight-processor SMP with 4 GB of RAM), and (b) cannot be partitioned to run on a distributed network using MTS.

What are Microsoft's plans for supporting Distributed Message Passing (DMP)?
Distributed Message Passing is one of the intracluster communications techniques that are planned for Phase 2 of MSCS. (Another is I/O shipping.) Applications will be able to access MSCS DMP services through extensions to the Cluster API. MSCS in turn will host the DMP services over a variety of interconnect technologies including new low-latency drivers based on the Virtual Interface (VI) architecture. The result will be a standard infrastructure for supporting a new generation of scalable, cluster-aware applications.

Application and Service Support


What types of applications and services will benefit from MSCS clustering?
There are three types of server applications that will benefit from MSCS clusters:

What software vendors will offer cluster-aware applications for MSCS?
Software vendors that have already announced plans to offer products for MSCS clusters include Baan, Cheyenne, Computer Associates (CA/Unicenter TNG), HP (ClusterView), IBM (DB2), NetIQ, Octopus, Oracle (Oracle 7 Failsafe), SAP, Vinca, and, of course, Microsoft (Microsoft SQL Server, Enterprise Edition, and Exchange Server, Enterprise Edition.) For an up-to-date list of announced products that support MSCS, refer to the Microsoft Windows NT Server, Enterprise Edition Solutions Directory look here.

Will Microsoft validate or logo software products that work with MSCS?
Microsoft will not have a validation program for MSCS-based software products at first. It is expected that once MSCS clusters are deployed in volume and there are sufficient examples of cluster-aware application products to evaluate, Microsoft will extend its Microsoft BackOffice logo program to include, at a minimum, validation of support for basic failover operation on an MSCS cluster.

What are Microsoft's plans for supporting Microsoft SQL Server on MSCS clusters?
Microsoft SQL Server, Enterprise Edition version 6.5 is available now and provides "active/active" cluster support (for example, both servers can be running SQL Server, with each server supporting its own databases). Microsoft SQL Server 7.0, currently in beta test, will include additional cluster-aware enhancements that provide for faster recovery in the event of a server or application failure. The version of Microsoft SQL Server that follows release 7.0 will include new features for shared-nothing scalability on MSCS clusters (for example, a single database will be able to span multiple servers).

What are Microsoft's plans for supporting Microsoft Exchange Server on MSCS clusters?
Microsoft Exchange Server Enterprise Edition 5.5 supports cluster failover and is shipping today.

Can the standard versions of Microsoft SQL Server 6.5 or Exchange Server 5.0 be set up for failover on a cluster using the "generic application" capability of MSCS?
Technically proficient customers who want to test Microsoft SQL Server 6.5 or Exchange Server 5.5 on a cluster may do so using the generic application capability of MSCS. However, the setup can be complex, and will not be supported by Microsoft support services. Therefore, customers should only do so for testing purposes, not for production deployments. Microsoft SQL Server, Enterprise Edition version 6.5, and Exchange Server, Enterprise Edition 5.5 feature a simplified cluster setup procedure, and are fully supported for failover on MSCS clusters.

Will Microsoft SNA Server benefit from MSCS?
No, because Microsoft SNA Server already provides a hot failover capability independent of MSCS.

Will Microsoft Proxy Server benefit from MSCS?
No, because the current version of Microsoft Proxy Server has its own capability for chaining together multiple servers for high availability and scalability.

Will Microsoft Systems Management Server benefit from MSCS?
No, MSCS will not provide high availability for the current release of Microsoft Systems Management Server. Microsoft intends to provide cluster-enabled high availability for Systems Management Server in a future release.

Can MSCS failover a Windows NT Server Directory (Domain) Controller?
No, because it is already possible to have backup directory service controllers for high availability. Servers in an MSCS cluster may be either primary or backup directory controllers for Windows NT Directory Services.

Can MSCS failover a WINS (Windows Internet Name Service) server?
No, because it is already possible to have backup WINS servers for high availability.

Can MSCS failover Remote Access Services (RAS)?
Remote Access Services cannot benefit from MSCS at this time since there is no standard method for doing software failover of modem connections. For higher reliability of dial-up connections, you can use the RAS Multi-Link capability first introduced in Windows NT Server 4.0.

Can MSCS failover Microsoft Distributed File System (Dfs) directories?
Not in Windows NT Server, Enterprise Edition 4.0. The version of Dfs in Windows 2000 Server will provide directory replication for fault tolerance. When used on the Enterprise Edition of Windows 2000 Server, Dfs will also work with MSCS failover for fast recovery from server crashes.

What versions of Oracle will benefit from MSCS clusters?
Oracle has announced that Oracle Failsafe 2.0 is available for Oracle7 customers at no extra cost. It provides "active/active" database failover on MSCS clusters (for example, can run on both servers at the same time, and either can failover to the other server in the event of an application or server failure). For more information, refer to http://ntsolutions.oracle.com/index.htm.

Does Tandem NonStop SQL/MX use MSCS?
Tandem NonStop SQL/MX uses MSCS clustering services when running on a two-server cluster. NonStop SQL/MX uses its own single-application clustering services when running on a cluster with more than two servers. Customers who want high availability plus database scalability up to the performance provided by two high-end SMP servers, will benefit by running NonStop SQL/MX on MSCS to gain the additional benefits of high availability for other services and applications on the cluster. Customers who require additional scalability would use the built-in single-application cluster services of NonStop SQL/MX, trading off general availability services for the ability to scale on more than two servers.

Microsoft Cluster Server and Windows NT Load Balancing Service


How does Microsoft Cluster Server work with Windows NT Load Balancing Service?
Windows NT load balancing service is fully complementary to Microsoft Cluster Server. Microsoft Clustering Service provides a non-stop reliable platform for data base, messaging and related application services through fail-over clustering for two nodes. Windows NT Load Balancing Service balances and distributes client connections (TCP/IP connections) over multiple servers. In a three tier model, MSCS handles the application layer and the data layer, while the Convoy or Windows NT Load Balancing Service is focused on handling the front end connections. When used together, Microsoft Cluster Server and Windows NT Load Balancing Service provide customers with a highly scalable, reliable and available system. This is an industry leading way to combine transactional systems with a web-based front end, and to deliver the scale, availability and robustness demanded by enterprise class customers.