High Availability Whitepaper

Overstock Specials!

Product Overview of HATS  High Availability

HATS  HA's Initialization


    Information is an important asset for an enterprise in  the 1990s. Today's computer systems must provide reliable and timely  information and services to assist personnel in making informed  decisions crucial to the daily operations of the modern  enterprise.

    Despite the rapid evolution in all aspects of computer  technology, both the computer hardware and software are prone to  numerous failure conditions. Employing methods to minimize exposure to  as many failure conditions as possible will significantly increase the  use of a company's resources and become a direct indicator of successful  business operations.

    Enterprises today are demanding their new computer  systems be "right-sized" for quicker deployment, cost reductions in  ownership, decrease in support and maintenance expenses, and increase in  support for both homogeneous and heterogeneous distributed client/server  environments. The concepts of "data warehouses" and "replication  services" provide the mechanisms to ensure the correct information is  always available to the appropriate personnel so they can make the  important business decisions in time-critical situations.

    This new form of information service must be constantly  monitored and tuned to provide reliable and accurate information  delivery. Hardware failure of these "central repositories" could prove  harmful in today' s competitive environment.



    Availability of critical information services is  affected by both scheduled and unscheduled system downtimes. Although  scheduled downtime for system maintenance and upgrades are inevitable;  they are fatal to information services that are considered  non-interruptible. In addition, unscheduled downtime is unpredictable  and should be avoided. Human errors, operating system failures, computer  hardware failures, and network failures are usually the cause for most  unscheduled downtimes.

    In order to describe the methods for preventing these  failures, we must first understand the definitions of the various levels  of availability. They are normal availability, high availability, and  fault tolerance.

      Normal Availability Systems  (NAS):
      Normal availability systems are defined  as general-purpose computer hardware and software systems that have no  hardware redundancy or software enhancement to provide  fault-processing recovery. They require manual, human intervention to  identify and correct/repair the failed component(s) and restart the  system before resuming normal operations.

      High Availability  System:
      High availability systems are  defined as loosely coupled NAS with redundant hardware components  managed by software that provides fault detection and correction  procedures to maximize the availability of the critical services and  applications provided by that system. These systems require no manual,  human intervention to identify a failed component, execute a procedure  to avert a system failure, and notice the averted failure. This  configuration minimizes the possibility of immediate data loss and  service interruption.

      High Availability  Models:
      There are two distinct High  Availability Models for client-server architectures. They are the  Replicated Services Model and the Failover Model.

      Replicated Services  Model:
      This model utilizes distributed applications  and distributed databases on multiple servers in the LAN/WAN  environment where the data is replicated to some or all of the  servers. When a server failure occurs, the data and applications are  accessible from an alternate server.

      Failover  Model:
      This model utilizes duplicate server hardware  configurations in which one server has the role of an active server  for data and application services, and the other is a backup server  that monitors the state of the active server. When the backup server  detects a hardware or software failure that has occurred on the active  server, it takes over the role and identity of the active server.

      Fault  Tolerance:
      Fault Tolerance definition  consists of proprietary, expensive, and tightly coupled duplicated  systems. Fault handling capabilities are integrated into and become a  function of the operating system. These systems have spontaneous and  fully automatic response to system failures and provide uninterrupted  services.



    It is H.A. Technical Solutions's goal to deliver a  product that maintains open systems' compatibility with the platforms  architecture we support and their operating systems. Our product  provides reliable accessibility to applications and data through  automatic failover processing, GUI administration, and alerting  facilities. It is flexible to be adapted for differences in individual  implementation requirements and future scalability.

      Open Systems  Compatibility:
      H.A. Technical Solutions HA (HATS HA)  retains the performance, cost effectiveness, and technology advantages  of Open Systems Architecture. It is compatible with existing services  native to Solaris (e.g., NFS, Telnet, ftp, etc.) and applications such  as Sybase and other common hardware and software products that are  available through Sun Microsystems and other third party vendors.

      Reliability and  Accessibility:
      When a failure event occurs, it  typically takes 5 seconds for fault detection and 10 to 120 seconds  for failover to initiate in RDBMS environment. The failover process  for Internet applications can take much less time (often sub 1 second)  by using additional software that journals the entries on the Storage  system. The failover processing occurs automatically without rebooting  the backup server. This provides the highest data and service  availability in a distributed SPARC-architecture, client/server  environment. There is no single point of failure to prevent  accessibility to the data and application services provided by the  system.

      HATS HA can manage redundant servers, network  communication links, network adapters, shared disk subsystems, and  SCSI disk adapters to achieve high availability. A standard  configuration consists of two SPARC servers or PC's, each with two  SCSI interfaces, two Ethernet interfaces, and an internal disk. The  servers are connected to an external disk subsystem that could be a  single disk or a RAID disk array. One of the networks is a "private"  connection shared between the two servers, and the other is the  "public" network providing connection to the client workstations for  services, data, and applications. (See Figure 1.)

      An Active Server is defined as the computer system  that provides critical services, data, and/or applications to the  client workstations.

      A Backup Server is defined as a computer system that  is configured for resuming the functionality of the Active Server. A  Backup Server can be dedicated or non-dedicated. It can also be an  Active Server at the same time.

      As a dedicated Backup Server, its function is simply  to wait for a failover event and take over the role of the Active  Server. When configured as a non-dedicated Backup Server, it can be  providing services, data, and/or applications to clients as well as  waiting for a failover event to occur. Multiple non-dedicated Backup  Servers can be identified and configured to divide up the workload of  an Active Server whenever a failover event occurs. These multiples,  non-dedicated Backup Servers can then take over the workload of a  failed Active Server in a pre-defined scheme that is configured by the  System Administrator. Additional redundancy can thus be configured  into the routine. For example: If one of the Backup Servers fails to  react properly to a failover event, another Backup Server can then  detect this failure and take over the workload for the "failed" Backup  Server. This definition allows for the Backup Server to also have the  role of an Active Server itself.

      HATS  HA's Initialization:

    After the system bootstrap process is completed:

    • HATS HA Manager is the first initiated daemon process  on each server.
    • HA Manager is the HATS HA Kernel.
    • The HA Manager initializes the necessary processes  and configures the server for failover processing as defined in the  HATS HA configuration.

    Failover Detection  Process:
    HATS HA Agents are daemon processes that  monitor and manage the defined critical services that are provided by  the Active Server. These agents provide status signals for these  critical services to the HA Manager in the form of "electronic"  heartbeats. While the HA Manager on the active server is receiving an  "alive" or "healthy" heartbeat signal from all of its agent processes,  it sends a heartbeat to the HA Manager on the backup server. This HA  Manager to HA Manager heartbeat function informs the backup server that  the active server is currently in good "health" and operating properly.  When this active server to backup server heartbeat is absent, the backup  server assumes that the active server has failed and initiates the  defined failover processes.

    If a critical service on the active server fails, the  agent will send a "fail" heartbeat to the HA Manager on the active  server.

    If an agent on the active server itself fails, the HA  Manager on that server detects the absence of the agent's heartbeats  and, after a configurable time-out, performs designated tasks to restart  the applications.

    Any critical service on the active server can be  monitored by more than one agent process. Each agent is designed to  monitor a specific or unique aspect of that service. The service is  considered to be available and "healthy" as long as the HA Manager is  receiving at least one heartbeat function from the individual agents  that are monitoring that service. Thus, if three agents are monitoring  one service and one or two of them detect a failure, the HA Manager will  not initiate failover processing for that service until the third  heartbeat function also signals a failure.

    The agents monitor services that include communication,  file, disk, network, NFS, NIS, DNS, and RDBMS. End users may also define  and develop agents that are customized for special application services  that may be deployed.

    HATS HA uses the standard RPC calls to exchange  information with the HA Manager to agents and HA Manager to HA Manager  heartbeat functions. This implementation scheme is beneficial for the  end user since these standard mechanisms allow for upgrades to new  communication media without impact on the integrity of the HATS HA  design.

    Failure  Processing:
    HATS HA issues immediate and automatic  actions against specific faults. Following are some possible  user-configured responses to selected failed services:

    • The failure of the service is ignored
    • A hardware device has failed and an alternate is  configured and available
      • Failover to an alternate device is  initiated
    • A software service such as NFS or DNS has  failed
      • Failover is initiated immediately to resume the  failed service
    • The system providing the service is shut down
    • A software application has failed
      • A number of attempts to restart the service on the  active server are triggered
      • If the service fails to restart, the user may  choose to:
      • IGNORE the failure of the service
      • HALT processing of the service
      • FAILOVER the service to the designated backup  server

    Failover  Processing:
    In the event of a failover event, there will  be a brief interruption in services for the failure recognition and  failover process to initialize the services on the backup server. This  process will occur automatically without human interaction. Once the  failover processing is completed, the services provided by the active  server will be operating on the backup server.

    Either the active server or the backup server,  depending on the kind of heartbeat loss and the user-defined procedure  to follow once that heartbeat function has failed can initiate a  failover process.

    A failover process involves transferring to the backup  server the network identity of the active server (its TCP-IP and MAC  Address), the shared disk subsystem(s) (which can be a mirrored disk set  or disk array), and the designated services provided by the active  server.

    The failover of stateless applications such as NFS,  NIS, and DNS are transparent to the end user. Stateful applications such  as FTP, Remote Login, and Telnet must re-establish the connection to the  server application after the failover event. Other processes such as  Client/Server database applications can be programmed to acquire the  status of the application/service provided by the active server; then it  would be able to resume or reconnect to the service that is now provided  by the backup server. This programming technique would allow these  stateful applications to appear as stateless applications to the end  user.

    Unfortunately, terminals that are directly connected to  serial ports on the active server will be rendered unusable due to the  nature of serial port interfaces. These interfaces cannot be "failed  over" to another server. However, network terminal servers and  client terminal processes such as X-Windows, Telnet and Remote Login  sessions will also be terminated, but the users can then reestablish  connections to the backup server and continue to access the  applications/services that have been resumed on that server.

    When the failed active server recovers or has been  repaired, HA Manager may be reconfigured to allow it to now play the  role of the backup server, if desired. Otherwise, it can be configured  to reclaim its original role as the active server and restore its  network identity, resources, applications and services from the backup  server and return to active server status.



    Server  Configuration:
    User configurable parameters are included  in the HATS HA product. They are configured to nominal default settings,  but can be customized to the enterprise requirements as needed. The  following is a partial list of some popular configurable parameters:

      Ÿ The server(s) and the  defined role(s) in the HA configuration

      Ÿ Network configuration  information for the private network and public network

      Ÿ The backup server(s)  designated for specific the active server(s)

      Ÿ Script names to execute  to initiate specific services

      Ÿ Critical services and  their corresponding HA Agents.

      Ÿ HA Agent process names  and their heartbeat failure time-out limits and failure  processing/actions

      Ÿ HA Manager failure  time-out limits and failure processing/actions

      Ÿ The maximum number of  restart attempts for a failed service before failover processing  begins

      Ÿ Prerequisite services  that must be operating before failover processing is  initiated

    Shell  Scripts:
    User defined shell scripts can add  functionality, reliable logging of events and notification processing in  response to processing of a failed service. User defined shell scripts  can perform many tasks, some of which are listed below.

      Ÿ Start and stop various  services

      Ÿ Define follow-up  procedures for the failed service(s)

      Ÿ Send messages to the  system console

      Ÿ Writes log file  information for troubleshooting purposes

      Ÿ Write a message to the  system logger

      Ÿ Notify support  personnel via pager

      Ÿ Notify help desk  personnel via e-mail or other system management software

      Ÿ Broadcast messages to  all users

    User-Defined  Agents:
    HATS HA provides both the API and HA agent  templates for user-defined agents specific for the user's desired  requirements. In order to program these templates, the user needs to  have a working knowledge of the C programming language and the  application or service that the new agent will monitor. Only the  component that will interact with the service needs to be programmed.  The API and HA agent templates will provide the other functions.

    System  Administration:
    System administration utilities include  support for checking configurations, installing HATS HA onto a new  server and configuration and management of the HATS HA environment. All  these functions can be performed from a single node on the network or  from a system console.

    HATS HA provides a graphical user interface (GUI) for  easy system administration and HA Agent/Manager monitoring from any  character based or X-Windows terminal. The system administrator can  issue commands such as:

      Ÿ Start and stop the HA  Manager

      Ÿ Start and stop specific  HA Agents

      Ÿ Start and stop specific  services

      Ÿ Force a failover  process to occur from either the active or backup server

      Ÿ Monitor and query the  status of servers, networks, services and agents

      Ÿ Verify HA server  configuration(s)

      Ÿ Install H.A. Technical  Solutions HA software on another server

      Ÿ Configure and manage  the enterprise H.A. Technical Solutions HA environment

    HA  Initiation:
    The first initiated process of  HATS HA is a daemon process called HA manager. Each server that starts  HATS HA is configured by HA manager according to the settings defined in  the HA configuration file. The HA configuration file of each server is  the same. All the services and their corresponding agents (the processes  that monitor and manage the services) are all specified in the HA  configuration file. A server will only initiate the services and the  corresponding agents that the server is designated to run.

    Failure  Detection:
    Agents are processes that monitor and manage  critical hardware and software services. Agents inform the HA manager of  the current status of the services with agent heartbeats. If a service  should fail, the agent will stop sending heartbeat to HA manager and a  predefined failover procedure will be taken. Each server handles only  those services and corresponding agents that are initiated by its HA  manager.

    A service can be monitored by more than one agent. Each  agent monitors different aspects of the service. The service is  considered available as long as all of its agents send a heartbeat to  the HA manager.

    A service can also be agentless. There will be no agent  to watch over the availability of the service. The service is considered  available as long as the service is properly started; a failover will  only occur when the active server itself has failed.

    In addition to agents for communication services, file  services, disk services, NFS, RDBMS, etc., users may create agents for  their own developed applications.

    When all the services managed by the HA manager are  considered available, a server heartbeat will be broadcast to the  related backup server(s). The loss of a server heartbeat from an active  server will cause a failover of all of its services to the corresponding  backup server(s).

    HATS HA utilizes the UDP protocol for exchanging the  heartbeat messages between HA managers and HA agents.

    Client  workstations are network nodes and/or terminals that access applications  and/or services provided by the active servers. They can be Intel-based  PC's using NFS and RDBMS application services, X-Terminals, other UNIX  servers utilizing network time-sync services, DNS, NIS or NFS mounted  file systems, network print devices, network modem pools, etc.

    Server  Interconnection:
    The mode of communication between  client and server processes is TCP/IP. The user may define the physical  delivery medium that best fits the enterprise requirements. These links  include Ethernet (IEEE 802.3), Token Ring (IEEE 802.5), FDDI or  Asynchronous SLIP and X.25 connections. HATS HA require two network  links between the active and backup servers: a private network and a  public network.

    The mode of communication between client and server  processes is TCP/IP. The user may define the physical delivery medium  that best fits the enterprise requirements. These links include Ethernet  (IEEE 802.3), Token Ring (IEEE 802.5), FDDI or Asynchronous SLIP and  X.25 connections. HATS HA require two network links between the active  and backup servers: a private network and a public network.

    The private network is used for a dedicated  communication link between the active and

    Backup servers for exchanging heartbeat messages that  inform the machines of each other's HA Agent and Manager status. This  private network can be an asynchronous communication link (serial port)  if the heartbeat link is point-to-point, or a network link as described  above.

    The public network is used for clients to access the  applications and/or services provided by the active server. This is the  "normal" path used for the general access point to these applications  and/or services.

    Storage Device  Configuration:
    The storage devices used most often in  the computer industry today are SCSI and SCSI-2 hard disks. They provide  good price/performance and price/storage capacity ratios. There are  several configurations that can be implemented with HATS HA. They are  internal disk drives, external disk drives, mirrored disk drives, and  Redundant Array of Inexpensive Disk (RAID) subsystems.

    • Internal disk drives are used for storing the  operating system, temporary spool areas, applications and data that is  not required to be accessible when a service failover occurs.
    • External (unshared) disk drives fall into the same  category/condition as stated for internal disk drives.
    • Mirrored disk drives allow for special redundancy and  special processing on the active server. These devices provide the  "first line of defense" in the event of a single failed disk device.  When the HA Agent detects a failure of the primary disk device, it can  respond by accessing the respective mirror disk device before  initiating a full system failover event.
    • RAID devices are usually external devices that can be  between two or more servers. When connected as multi-hosted devices,  they can be simultaneously connected to both the active and backup  servers. This provides the backup server a direct physical data path  to access the disk partition(s) or device(s) from which to launch the  critical applications and/or access the critical data after a failover  event has occurred. It can have several different configurations, the  most popular being RAID- 1, RAID-3, and RAID5.
    • RAID-1 provides hardware disk mirroring. The  definition of this architecture allows for the RAID hardware to detect  a failed disk device and process the failure automatically without  notification to the Operating System. However, the HA Agent can be  configured to query the RAID controller, detect the failure and  initiate a user defined process accordingly.
    • RAID-3 or RAID-5 configurations allow for the failure  of any one disk device without causing the failure of the entire disk  subsystem. Due to architecture definition of these RAID levels, the  data is divided and written to multiple disk units with a checksum  entry associated with each write command entry. When any one disk  fails, the missing data can then be reconstructed from the checksum  information. As in RAID- 1, the RAID controller manages this  automatically. Again, the HA Agent can be configured to query the RAID  controller, detect the failure and initiate a user defined process  accordingly.



    Hot Standby - One Active  Server
    The Hot Standby configuration defines one server  as a mission critical system and the backup server as the active  server's "immediate replacement." In other words, the backup server's  only function is to monitor the heartbeat functions of the active server  and wait for a failure event to process. Both servers are connected to  the private network, the public network, and a shared external disk  subsystem. This configuration can achieve consistent response time after  failover processing, but the resource of the backup server is  underutilized. This configuration must meet the following minimal  requirements:

    • One designated active server.
    • One designated backup server (MUST have the identical  internal configuration of the active server for memory, etc.).
    • One private network interface in each server.
    • One public network interface in each server.
    • Two (minimum) SCSI / SCSI-2 interfaces in each  system.
    • One (minimum) external SCSI / SCSI-2 dual-hosted disk  subsystem.

    Hot Standby - Two Active  Servers
    This configuration designates two servers as  mission critical systems and the third server as a hot standby server  for the two mission critical servers. This strategy of having a hot  standby server to take over if one of the mission critical servers  fails, prevents the costly down time associated with failures on a  mission critical server. Additionally you can use the hot standby server  for other non-mission critical applications.

    To deploy this configuration, all servers must be  connected to the same public network and a multi-hosted external disk  subsystem. The backup server can be configured to achieve specific  levels of performance. For example, by planning for the possibility of  both active servers failing simultaneously, the backup server can be  configured to resume all services from one of active servers at a time.  In the event the second server fails, the backup server can be  configured to run in a degraded state, or failover only the most  important, mission critical services. Conversely, the backup server can  be fitted with the physical capacity to resume all services from both  active servers at any time and operate within expected parameters.

    This configuration must meet the following minimal  requirements:

    • Two designated active servers
    • One designated backup server
    • One private network interface in each server
    • Two (minimum) SCSI / SCSI-2 interfaces in each  system
    • One (minimum) external SCSI / SCSI-2 multi-hosted  disk subsystem
    • One public network interface in each server.

    Warm Standby - Two Active  Servers:
    In this configuration, two mission critical  systems are configured as a mutual backup to one another. Utilizing this  strategy gains l00% utilization of hardware expenses since the servers  are both defined as mission critical. In the event of a failed service,  the other will resume the failed service(s). If one of the servers were  to fail entirely, the backup server may operate in a degraded state,  depending upon how the server is physically configured for memory, CPU,  etc.

    To deploy this configuration, both servers must be  connected to the same public network and an external shared disk  subsystem. This configuration must meet the following minimal  requirements:

    • Two designated active servers.
    • Two public network interfaces in each server.
    • Two (minimum) SCSI / SCSI-2 interfaces in each  system.
    • One (minimum) external SCSI / SCSI-2 dual-hosted disk  subsystem.




    • SPARC-compliant computer system or PC's running Solaris X86.
    • SGI
    • HP


    • 1 MB of RAM for H.A. Technical Solutions HA software.
    • 16 MB recommended to run applications.

    Disk Capacity

    • 3.3 MB of internal disk space for H.A. Technical Solutions HA  software.

    Supported  Applications

    • UNIX services such as NFS, NIS, DNS, etc.
    • Database (RDBMS) services such as Sybase, Ingress, etc.
    • User defined client/server applications in the fields such as  banking, crucial network services, government services, etc.



 Look for $avings on  surplus and/or re-furbished Sun Storage Products, Servers, and Workstations!

[DMI] [Features] [RMCP] [Solaris] [Whitepaper] [Compare]