Information is an important asset for an enterprise in the 1990s. Today's computer systems must provide reliable and timely information and services to assist personnel in making
informed decisions crucial to the daily operations of the modern enterprise.
Despite the rapid evolution in all aspects of computer technology, both the computer hardware and
software are prone to numerous failure conditions. Employing methods to minimize exposure to as many failure conditions as possible will significantly increase the use of a company's resources and
become a direct indicator of successful business operations.
Enterprises today are demanding their new computer systems be "right-sized" for quicker deployment, cost reductions
in ownership, decrease in support and maintenance expenses, and increase in support for both homogeneous and heterogeneous distributed client/server environments. The concepts of "data
warehouses" and "replication services" provide the mechanisms to ensure the correct information is always available to the appropriate personnel so they can make the important
business decisions in time-critical situations.
This new form of information service must be constantly monitored and tuned to provide reliable and accurate information delivery. Hardware
failure of these "central repositories" could prove harmful in today' s competitive environment.
DEFINITION OF SYSTEM AVAILABILITY:
Availability of critical information services is affected by both scheduled and unscheduled system downtimes. Although scheduled downtime for system maintenance and upgrades are
inevitable; they are fatal to information services that are considered non-interruptible. In addition, unscheduled downtime is unpredictable and should be avoided. Human errors, operating system
failures, computer hardware failures, and network failures are usually the cause for most unscheduled downtimes.
In order to describe the methods for preventing these failures, we
must first understand the definitions of the various levels of availability. They are normal availability, high availability, and fault tolerance.
Normal Availability Systems (NAS):Normal availability systems are defined as general-purpose computer hardware and software systems that
have no hardware redundancy or software enhancement to provide fault-processing recovery. They require manual, human intervention to identify and correct/repair the failed component(s) and
restart the system before resuming normal operations.
High Availability System:
High availability systems are defined as loosely
coupled NAS with redundant hardware components managed by software that provides fault detection and correction procedures to maximize the availability of the critical services and applications
provided by that system. These systems require no manual, human intervention to identify a failed component, execute a procedure to avert a system failure, and notice the averted failure. This
configuration minimizes the possibility of immediate data loss and service interruption.
High Availability Models:
There are two distinct
High Availability Models for client-server architectures. They are the Replicated Services Model and the Failover Model.
Replicated Services Model:
This model utilizes distributed applications and distributed databases on multiple servers in the LAN/WAN environment where the data is replicated to some
or all of the servers. When a server failure occurs, the data and applications are accessible from an alternate server.
This model utilizes duplicate server hardware configurations in which one server has the role of an active server for data and application services, and the other is a backup server that
monitors the state of the active server. When the backup server detects a hardware or software failure that has occurred on the active server, it takes over the role and identity of the active server.
Fault Tolerance definition consists of proprietary, expensive, and tightly coupled duplicated systems. Fault handling
capabilities are integrated into and become a function of the operating system. These systems have spontaneous and fully automatic response to system failures and provide uninterrupted services.
THE UNIQUE FEATURES IN H.A. TECHNICAL SOLUTIONS' HA
It is H.A. Technical Solutions's goal to deliver a product that maintains open systems' compatibility with the platforms architecture we support and their operating systems. Our
product provides reliable accessibility to applications and data through automatic failover processing, GUI administration, and alerting facilities. It is flexible to be adapted for differences in
individual implementation requirements and future scalability.
Open Systems Compatibility:H.A. Technical Solutions HA (HATS HA) retains the performance, cost effectiveness, and technology
advantages of Open Systems Architecture. It is compatible with existing services native to Solaris (e.g., NFS, Telnet, ftp, etc.) and applications such as Sybase and other common hardware and
software products that are available through Sun Microsystems and other third party vendors.
Reliability and Accessibility:
When a failure
event occurs, it typically takes 5 seconds for fault detection and 10 to 120 seconds for failover to initiate in RDBMS environment. The failover process for Internet applications can take much
less time (often sub 1 second) by using additional software that journals the entries on the Storage system. The failover processing occurs automatically without rebooting the backup server.
This provides the highest data and service availability in a distributed SPARC-architecture, client/server environment. There is no single point of failure to prevent accessibility to the data
and application services provided by the system.
HATS HA can manage redundant servers, network communication links, network adapters, shared disk subsystems, and SCSI disk
adapters to achieve high availability. A standard configuration consists of two SPARC servers or PC's, each with two SCSI interfaces, two Ethernet interfaces, and an internal disk. The servers
are connected to an external disk subsystem that could be a single disk or a RAID disk array. One of the networks is a "private" connection shared between the two servers, and the other is
the "public" network providing connection to the client workstations for services, data, and applications. (See Figure 1.)
An Active Server is defined as the computer
system that provides critical services, data, and/or applications to the client workstations.
A Backup Server is defined as a computer system that is configured for resuming the
functionality of the Active Server. A Backup Server can be dedicated or non-dedicated. It can also be an Active Server at the same time.
As a dedicated Backup Server, its function is
simply to wait for a failover event and take over the role of the Active Server. When configured as a non-dedicated Backup Server, it can be providing services, data, and/or applications to
clients as well as waiting for a failover event to occur. Multiple non-dedicated Backup Servers can be identified and configured to divide up the workload of an Active Server whenever a
failover event occurs. These multiples, non-dedicated Backup Servers can then take over the workload of a failed Active Server in a pre-defined scheme that is configured by the System
Administrator. Additional redundancy can thus be configured into the routine. For example: If one of the Backup Servers fails to react properly to a failover event, another Backup Server can
then detect this failure and take over the workload for the "failed" Backup Server. This definition allows for the Backup Server to also have the role of an Active Server itself.
After the system bootstrap process is completed:
HATS HA Manager is the first initiated daemon process on each server.
HA Manager is the HATS HA Kernel.
The HA Manager initializes the necessary processes and configures the server for failover processing as defined in the HATS HA configuration.
Failover Detection Process:
HATS HA Agents are daemon processes that monitor and manage the defined critical services that are provided by
the Active Server. These agents provide status signals for these critical services to the HA Manager in the form of "electronic" heartbeats. While the HA Manager on the active server is
receiving an "alive" or "healthy" heartbeat signal from all of its agent processes, it sends a heartbeat to the HA Manager on the backup server. This HA Manager to HA Manager
heartbeat function informs the backup server that the active server is currently in good "health" and operating properly. When this active server to backup server heartbeat is absent, the
backup server assumes that the active server has failed and initiates the defined failover processes.
If a critical service on the active server fails, the agent will send a
"fail" heartbeat to the HA Manager on the active server.
If an agent on the active server itself fails, the HA Manager on that server detects the absence of the agent's
heartbeats and, after a configurable time-out, performs designated tasks to restart the applications.
Any critical service on the active server can be monitored by more than one
agent process. Each agent is designed to monitor a specific or unique aspect of that service. The service is considered to be available and "healthy" as long as the HA Manager is
receiving at least one heartbeat function from the individual agents that are monitoring that service. Thus, if three agents are monitoring one service and one or two of them detect a failure, the HA
Manager will not initiate failover processing for that service until the third heartbeat function also signals a failure.
The agents monitor services that include communication,
file, disk, network, NFS, NIS, DNS, and RDBMS. End users may also define and develop agents that are customized for special application services that may be deployed.
HATS HA uses the
standard RPC calls to exchange information with the HA Manager to agents and HA Manager to HA Manager heartbeat functions. This implementation scheme is beneficial for the end user since these
standard mechanisms allow for upgrades to new communication media without impact on the integrity of the HATS HA design.
HATS HA issues immediate and automatic actions against specific faults. Following are some possible user-configured responses to selected failed services:
The failure of the service is ignored
A hardware device has failed and an alternate is configured and available
A software service such as NFS or DNS has failed
The system providing the service is shut down
A software application has failed
A number of attempts to restart the service on the active server are triggered
If the service fails to restart, the user may choose to:
IGNORE the failure of the service
HALT processing of the service
FAILOVER the service to the designated backup server
In the event of a failover event, there will be a brief interruption in services for the failure recognition and
failover process to initialize the services on the backup server. This process will occur automatically without human interaction. Once the failover processing is completed, the services provided by the
active server will be operating on the backup server.
Either the active server or the backup server, depending on the kind of heartbeat loss and the user-defined procedure to follow
once that heartbeat function has failed can initiate a failover process.
A failover process involves transferring to the backup server the network identity of the active server (its
TCP-IP and MAC Address), the shared disk subsystem(s) (which can be a mirrored disk set or disk array), and the designated services provided by the active server.
The failover of
stateless applications such as NFS, NIS, and DNS are transparent to the end user. Stateful applications such as FTP, Remote Login, and Telnet must re-establish the connection to the server
application after the failover event. Other processes such as Client/Server database applications can be programmed to acquire the status of the application/service provided by the active server; then
it would be able to resume or reconnect to the service that is now provided by the backup server. This programming technique would allow these stateful applications to appear as stateless
applications to the end user.
Unfortunately, terminals that are directly connected to serial ports on the active server will be rendered unusable due to the nature of serial port
interfaces. These interfaces cannot be "failed over" to another server. However, network terminal servers and client terminal processes such as X-Windows, Telnet and Remote Login
sessions will also be terminated, but the users can then reestablish connections to the backup server and continue to access the applications/services that have been resumed on that server.
When the failed active server recovers or has been repaired, HA Manager may be reconfigured to allow it to now play the role of the backup server, if desired. Otherwise, it can be
configured to reclaim its original role as the active server and restore its network identity, resources, applications and services from the backup server and return to active server status.
H.A. TECHNICAL SOLUTIONS HA CUSTOMIZATION
Server Configuration:User configurable parameters are included in the HATS HA product. They are configured to nominal default settings,
but can be customized to the enterprise requirements as needed. The following is a partial list of some popular configurable parameters:
The server(s) and the defined role(s) in the HA configuration
Network configuration information for the private network and public network
The backup server(s) designated for specific the active server(s)
Script names to execute to initiate specific services
Critical services and their corresponding HA Agents.
HA Agent process names and their heartbeat failure time-out limits and failure processing/actions
HA Manager failure time-out limits and failure processing/actions
The maximum number of restart attempts for a failed service before failover processing begins
Prerequisite services that must be operating before failover processing is initiated
User defined shell scripts can add functionality, reliable logging of events and notification processing in response to
processing of a failed service. User defined shell scripts can perform many tasks, some of which are listed below.
Start and stop various services
Define follow-up procedures for the failed service(s)
Send messages to the system console
Writes log file information for troubleshooting purposes
Write a message to the system logger
Notify support personnel via pager
Notify help desk personnel via e-mail or other system management software
Broadcast messages to all users
HATS HA provides both the API and HA agent templates for user-defined agents specific for the user's desired
requirements. In order to program these templates, the user needs to have a working knowledge of the C programming language and the application or service that the new agent will monitor. Only the
component that will interact with the service needs to be programmed. The API and HA agent templates will provide the other functions.
System administration utilities include support for checking configurations, installing HATS HA onto a new server and configuration and management of the HATS HA
environment. All these functions can be performed from a single node on the network or from a system console.
HATS HA provides a graphical user interface (GUI) for easy system
administration and HA Agent/Manager monitoring from any character based or X-Windows terminal. The system administrator can issue commands such as:
Start and stop the HA Manager
Start and stop specific HA Agents
Start and stop specific services
Force a failover process to occur from either the active or backup server
Monitor and query the status of servers, networks, services and agents
Verify HA server configuration(s)
Install H.A. Technical Solutions HA software on another server
Configure and manage the enterprise H.A. Technical Solutions HA environment
The first initiated process of HATS HA is a daemon process called HA manager. Each server that starts HATS HA
is configured by HA manager according to the settings defined in the HA configuration file. The HA configuration file of each server is the same. All the services and their corresponding agents (the
processes that monitor and manage the services) are all specified in the HA configuration file. A server will only initiate the services and the corresponding agents that the server is designated
Agents are processes that monitor and manage critical hardware and software services. Agents inform the HA manager
of the current status of the services with agent heartbeats. If a service should fail, the agent will stop sending heartbeat to HA manager and a predefined failover procedure will be taken. Each
server handles only those services and corresponding agents that are initiated by its HA manager.
A service can be monitored by more than one agent. Each agent monitors different
aspects of the service. The service is considered available as long as all of its agents send a heartbeat to the HA manager.
A service can also be agentless. There will be no agent
to watch over the availability of the service. The service is considered available as long as the service is properly started; a failover will only occur when the active server itself has failed.
In addition to agents for communication services, file services, disk services, NFS, RDBMS, etc., users may create agents for their own developed applications.
When all the
services managed by the HA manager are considered available, a server heartbeat will be broadcast to the related backup server(s). The loss of a server heartbeat from an active server will cause a
failover of all of its services to the corresponding backup server(s).
HATS HA utilizes the UDP protocol for exchanging the heartbeat messages between HA managers and HA agents.
Client workstations are network nodes and/or terminals that access applications and/or services provided by the
active servers. They can be Intel-based PC's using NFS and RDBMS application services, X-Terminals, other UNIX servers utilizing network time-sync services, DNS, NIS or NFS mounted file systems,
network print devices, network modem pools, etc.
The mode of communication between client and server processes is TCP/IP.
The user may define the physical delivery medium that best fits the enterprise requirements. These links include Ethernet (IEEE 802.3), Token Ring (IEEE 802.5), FDDI or Asynchronous SLIP and X.25
connections. HATS HA require two network links between the active and backup servers: a private network and a public network.
The mode of communication between client and server
processes is TCP/IP. The user may define the physical delivery medium that best fits the enterprise requirements. These links include Ethernet (IEEE 802.3), Token Ring (IEEE 802.5), FDDI or Asynchronous
SLIP and X.25 connections. HATS HA require two network links between the active and backup servers: a private network and a public network.
The private network is used for a
dedicated communication link between the active and
Backup servers for exchanging heartbeat messages that inform the machines of each other's HA Agent and Manager status. This
private network can be an asynchronous communication link (serial port) if the heartbeat link is point-to-point, or a network link as described above.
The public network is used for
clients to access the applications and/or services provided by the active server. This is the "normal" path used for the general access point to these applications and/or services.
Storage Device Configuration:
The storage devices used most often in the computer industry today are SCSI and SCSI-2 hard
disks. They provide good price/performance and price/storage capacity ratios. There are several configurations that can be implemented with HATS HA. They are internal disk drives, external disk
drives, mirrored disk drives, and Redundant Array of Inexpensive Disk (RAID) subsystems.
Internal disk drives are used for storing the operating system, temporary spool areas, applications and data that is not required to be accessible when a service failover
External (unshared) disk drives fall into the same category/condition as stated for internal disk drives.
Mirrored disk drives allow for special redundancy and special processing on the active server. These devices provide the "first line of defense" in the event of a
single failed disk device. When the HA Agent detects a failure of the primary disk device, it can respond by accessing the respective mirror disk device before initiating a full system
RAID devices are usually external devices that can be between two or more servers. When connected as multi-hosted devices, they can be simultaneously connected to both the
active and backup servers. This provides the backup server a direct physical data path to access the disk partition(s) or device(s) from which to launch the critical applications and/or
access the critical data after a failover event has occurred. It can have several different configurations, the most popular being RAID- 1, RAID-3, and RAID5.
RAID-1 provides hardware disk mirroring. The definition of this architecture allows for the RAID hardware to detect a failed disk device and process the failure automatically
without notification to the Operating System. However, the HA Agent can be configured to query the RAID controller, detect the failure and initiate a user defined process accordingly.
RAID-3 or RAID-5 configurations allow for the failure of any one disk device without causing the failure of the entire disk subsystem. Due to architecture definition of these
RAID levels, the data is divided and written to multiple disk units with a checksum entry associated with each write command entry. When any one disk fails, the missing data can then be
reconstructed from the checksum information. As in RAID- 1, the RAID controller manages this automatically. Again, the HA Agent can be configured to query the RAID controller, detect the
failure and initiate a user defined process accordingly.
Hot Standby - Two Active ServersThis configuration designates two servers as mission critical systems and the third server as a hot standby
server for the two mission critical servers. This strategy of having a hot standby server to take over if one of the mission critical servers fails, prevents the costly down time associated with
failures on a mission critical server. Additionally you can use the hot standby server for other non-mission critical applications.
To deploy this configuration, all servers must be
connected to the same public network and a multi-hosted external disk subsystem. The backup server can be configured to achieve specific levels of performance. For example, by planning for the
possibility of both active servers failing simultaneously, the backup server can be configured to resume all services from one of active servers at a time. In the event the second server fails, the
backup server can be configured to run in a degraded state, or failover only the most important, mission critical services. Conversely, the backup server can be fitted with the physical capacity to
resume all services from both active servers at any time and operate within expected parameters.
This configuration must meet the following minimal requirements:
Two designated active servers
One designated backup server
One private network interface in each server
Two (minimum) SCSI / SCSI-2 interfaces in each system
One (minimum) external SCSI / SCSI-2 multi-hosted disk subsystem
- One public network interface in each server.
Warm Standby - Two Active Servers:In this configuration, two mission critical systems are configured as a mutual backup to one another.
Utilizing this strategy gains l00% utilization of hardware expenses since the servers are both defined as mission critical. In the event of a failed service, the other will resume the failed
service(s). If one of the servers were to fail entirely, the backup server may operate in a degraded state, depending upon how the server is physically configured for memory, CPU, etc.
To deploy this configuration, both servers must be connected to the same public network and an external shared disk subsystem. This configuration must meet the following minimal
Two designated active servers.
Two public network interfaces in each server.
Two (minimum) SCSI / SCSI-2 interfaces in each system.
One (minimum) external SCSI / SCSI-2 dual-hosted disk subsystem.
- SPARC-compliant computer system or PC's running Solaris X86.
- 1 MB of RAM for H.A. Technical Solutions HA software.
- 16 MB recommended to run applications.
- 3.3 MB of internal disk space for H.A. Technical Solutions HA software.
- UNIX services such as NFS, NIS, DNS, etc.
- Database (RDBMS) services such as Sybase, Ingress, etc.
- User defined client/server applications in the fields such as banking, crucial network services, government services, etc.