6 Key Steps to Designing an Epic Disaster Recovery Strategy in the Public Cloud

Disasters are unpredictable, but your recovery strategy shouldn’t be. Whether you’re moving all of Epic to the cloud or starting with Alternate Production, having a well-architected disaster recovery plan is critical to maintaining hospital operations when the unexpected happens.

Epic health systems and payers typically adopt public cloud for Epic in one of two ways – a full migration inclusive of all Production, Alternate Production, Training and Build environments, or a phased approach that includes Alternate Production or Build environments in the first phase. Regardless of the approach that best fits your organization, it is critical to have the ability to fully support Production operations in the event of a disaster that compromises your Production site.

The beauty – and challenge – with public cloud is there are many ways to architect an Alternate Production environment. So, here is checklist to help you plan, architect and implement an Alternate Production environment on public cloud that best suits the requirements of your organization.

Step 1: Identify Critical Environments for Disaster Recovery

It is important to first understand what functions of Epic are critical to your organization’s success when a disaster event is declared. A disaster could be a data center power outage that lasts long enough to facilitate a full regional failover but can failback in a matter of days, or it could be a natural disaster that requires an indefinite failover to the alternate region. Assuming the worst, what does your organization need to run in an alternate region to maintain day-to-day hospital operations?

Typically, Epic recommends an Alternate Production copy of the Production Operational Database in a geographically dispersed (DR) region. Epic does not typically recommend alternate copies of SUP, RPT, ACE, Clarity or Caboodle. However, some organizations deem it necessary to have an alternate copy of those environments based on their business needs. For example, large research hospitals often elect to deploy an alternate copy of Clarity and Caboodle.

Step 2: Balance Cost vs. Value

Once your technical teams and business owners have determined what environments are critical to have running during a disaster recovery event, you need to understand the cost of the infrastructure required to support these environments in the alternate region.

It is important to determine the role an environment plays in providing patient care. Ask yourself, is this environment required for clinicians to deliver care to our patients? Or is the environment only needed for education or training?

Another way to look at it is in terms of how much revenue an environment may generate for the health system and compare that to the monthly cost of running that environment in the alternate region. For example, if a specific Clarity environment generates $25k for the health system in a month, but it costs $50k per month to run that environment in the alternate region, you may want to reconsider what you defined in Step 1.

There are soft costs to consider as well. For example, what is the impact to clinician efficiency if an environment or workflow isn’t available? Is a previously automated process now manual and takes minutes when it used to take seconds? These processes should be considered when rationalizing environments.

Step 3: Pick Your Savings Plan

Public cloud providers offer different mechanisms to optimize infrastructure for cost and/or availability. For instance, you can pay a premium to ensure that your infrastructure is available during a disaster. You can pay a higher premium to paired with capacity reservations to ensure the infrastructure is always running and available when needed. An important item to note – a virtual machine running as a reserved instance but does not have an associated capacity reservation may not be available when rebooted as it would be subject to availability of the given SKU because the capacity of that SKU is not “reserved.”
Conversely, you can pay a fraction of the cost to keep infrastructure turned off and accept the risk that a certain VM type may not be available when it is needed.

There is no “one size fits all” answer. The relationship of cost vs. risk is specific to each individual health system. This requires a detailed conversation with technical, business, and clinical stakeholders, and your trusted cloud partners to achieve the right balance.

Step 4: Leverage Automation Wherever Possible

Automation providers, like Terraform for example, reduce the time of a DR activation thus lowering your Recovery Time Objective (RTO), which is the time is takes from the moment a disaster is declared to the moment users can access the system in the alternate region.

Automation can enable cost savings by allowing organizations to keep a subset of infrastructure running in the alternate region and rapidly build out the rest of the infrastructure in the alternate region during a disaster.

Building from Step 3, assess where you can reduce the number of VMs that are running in the alternate region and supplement that with automation.

Step 5: Keep Alternate Production Environments Patched and Up to Date

As we’ve learned through previous steps, the cloud offers the benefit of agility and elasticity, allowing organizations to reduce the number of VMs that are powered on in the alternate region. For the VMs that are powered off, it is important to regularly turn them on to keep patches up to date. Patch cycles may run on a monthly, quarterly, semi-annually, or annual basis depending on the VM type and what it is used for. Ensure patches are up to date so that when it is time to use these VMs, they work effectively.

Step 6: Test and Document Your DR Plan Regularly

You can never predict when a disaster might occur. So, it is important to regularly test your DR environment.

Epic requires a full DR activation once per year as a minimum requirement for Honor Roll 7+ and an annual review of Business Continuity practices. Epic defines a DR activation as “switch and stay” meaning that your Alternate Production environment is now running as your Production environment with end users accessing those systems for more than 24 hours on a non-holiday business day. Executing these tests starts with proper documentation.

It is important to define and document operational runbooks and RACIs, so all parties know their responsibilities during a disaster. It is important to prove out these runbooks and architecture so that when disaster strikes your teams have the muscle memory to activate alternate production environments on top of a proven, stress tested architecture.

Implementing your Alternate Production environment on public cloud lends itself to the adoption of a wide array of technologies not available in the data center that can help you optimize your cloud cost and RTO. There are many ways to take advantage of these technologies. Lately, health systems and payers are benefitting from working with partners who have led these journeys and can lean on prior experience to assist with planning, architecting and implementing Alternate Production environments with these new technologies.

Recent Posts