header title imageheader spacer image

Inside This Issue

    VCS Practice Expertise
    Technology & Integration

  • Infrastructure Design and Implementation
  • LAN & WAN Solutions
  • Wireless & Mobility Solutions
  • Custom Report Writing
  • Custom Interface Services
  • Project Management
  • Identity Management

 

Technology & Integration Practice Newsletter
Volume 3 Issue 1, Page 3

Epic Systems Disaster Recovery & High Availability
By Hadeer Aburumuh

Disaster Recovery (DR) and Failover for High Availability (HA) are key requirements for Epic system operations. DR plays a key role in achieving the optimal installation status required by Epic. DR and Failover are designed for system availability and reliability, and give Cache systems managers and leadership peace of mind. This article discusses in high level detail how the DR and Failover mechanisms work.

Disaster Recovery - In this realm, Epic uses DR Shadow Server which is based solely on Epic’s own system architecture. A process copies the Cache database journal files (changes/updates) generated on the production server to a remote DR Shadow server. Once the journal files are applied to the DR Shadow server’s Cache database the changes made in production are now available there as well. The process of applying or reapplying journal files is also referred to as “replication” by other RDBMS vendors.

In a scheduled or unscheduled downtime, the DR Shadow server can now be promoted to take control of the production operations. Technically, the process uses a network socket to perform the journal file transfer from the production database and an internal Cache process replays/applies them on the DR Shadow database to bring it up to date with production. Periodically, journaling must be stopped and the production database must be copied over to DR Shadow server and the process of journaling started again. Epic support recommends that a full database copy from production to DR Shadow take place every 3 months. The process is known as a database refresh. Its main purpose is not only to ensure DR Shadow data integrity, but also to avoid any data loss caused by the effect of database updates that have taken place over a long time span. In shadowing, the production and DR Shadow databases are kept in sync within a split millisecond of each other. When a DR situation is declared, client PCs running the Hyperspace and Text mode applications will access the applications via a preconfigured Epic Shadow icon on their desktops using a login ID and password that matches their credentials in the Production environment.

There are two options available in DR Shadow implementations, Read Only mode (SRO) or Read Write mode. The trick is if the latter is chosen (Read Write mode), updates made to the Shadow server database during a disaster situation do not get written to the production database by default. In other words, an additional manual step is required to ensure the Shadow server database updates are transferred over to the Production server after the disaster situation has ended, and before switching back to the normal production operation.

This step can be accomplished by reversing the journal file copy order from DR Shadow to Production. However, if DR Shadow SRO mode is chosen then no additional work is required on the production side. In SRO, users are limited to lookup information only and the switch back to normal production becomes transparent since no updates occurred. The objective of DR Shadow is to ensure information is safe and available in/from a remote location. Epic’s DR Shadow implementation is a sound solution that gives customers a way to access information in a moment’s notice when system failures occur.

Failover - Now that we have discussed the Disaster Recovery aspect of Epic, let’s take some time to discuss another critical aspect of a smooth running implementation, failover. Failover implementation gives organizations double data protection and the ability to failover automatically in an error situation and fallback to the original configuration without having to synchronize databases.

In a failover situation, the primary production server encounters a problem, and the secondary server is promoted to take over production processing. This configuration relies heavily on redundant TCP networks and a heartbeat sensor in a clustered environment. Servers in this setup run failure detection and monitoring programs in the background. In HPUX this software is called Service Guard. Although the primary and secondary servers are physically independent of each other and each have their own IP address and host name, they still share a single “virtual” IP address. The IP address of the production server is resolved by the DNS to that of the virtual IP address.

While the primary and the secondary servers each run their own copy of Service Guard, they are preconfigured to share the disk volumes where the Epic Cache databases reside. Client PCs running Epic applications, whether on Terminal Services (e.g., Citrix, Microsoft, etc.) or standalone, use the network DNS configuration entries to resolve the Epic virtual IP address. Service Guard and Cluster Management (CM) software constantly monitor and determine which host in the cluster is up and routes client connections to the appropriate host accordingly.

If the primary server encounters a hardware, software, or network failure; Service Guard/CM automatically transfers operations (package) to the secondary server, hence failover. To accomplish this task, Service Guard on the primary server executes a package which contains customized shutdown scripts to gracefully shutdown the database on the primary server before transfer occurs. In turn, the secondary server executes its own set of startup database scripts to bring up the old production database on the failover server. The basic concept is to allow the failover server to import the production disk volumes containing the database. This can easily be accomplished using a SAN disk array. Other Epic processes such as scheduled batch jobs, and interfaces need to be preconfigured in Epic to permit them to auto start once the database is up. A 45 – 90 second outage is associated with the failover process, hence, the provision of high availability.

Desktop users need to close their already open Hyperspace sessions and reopen them again after failover is completed in order to access the information. Once the issue with the primary server is rectified, a fallback can follow. This can be done manually by running the shutdown scripts on the failover server and bringing them back up on the production server. This comes in handy when scheduled and unscheduled production server maintenance is required.

Implementing Epic’s DR Shadow in conjunction with failover is a well rounded solution for your Epic implementation, but could prove to be costly. However, the benefits received during a single downtime situation make it well worth the effort. Remember, there is no substitution to a well proven database backup and restore policy. If you have any questions related to this article, or are interested in assistance with the implementation of either or both of these scenarios, please contact me at haburumuh@getvitalized.com.