Blog Post

When the Cloud Goes Down: How Health IT Leaders Can Prepare for the Inevitable

by Charles Knight on October 24, 2025

When AWS experienced a widespread outage this week, many organizations across industries felt it, but in healthcare, the impact can be far more serious.

Inside a hospital, that kind of disruption doesn’t just slow down workflows. It can disrupt scheduling, cause trauma diversions, delay surgeries, and temporarily send teams back to paper. It can expose how dependent we’ve all become on systems that “just work” – until they don’t.

What Happens Inside a Hospital During an Outage

When an outage strikes – whether it’s on-prem hardware failure, a public cloud incident, or a regional network issue – the cascade of effects is immediate:

From experience, I’ve seen the risk curve look like a hockey stick – the longer systems stay down, the faster the risk escalates. Even short outages can ripple outward, impacting patient care, staff confidence, and community trust.

Outages Are Rare – But Inevitable

The truth is, these events don’t only happen in the cloud.

They can originate anywhere – a power failure, network misconfiguration, cyberattack, or even a local hardware issue. It doesn’t matter if you’re on-prem, hybrid, or public cloud. Something’s going to happen at some point – either within or outside your control.

Public cloud isn’t inherently more fragile; it’s just more visible when something goes wrong. Outages at scale make headlines, but smaller on-prem disruptions happen every day – they’re just less publicized.

As my colleague Luke Yerkovich often reminds our clients, disaster recovery is still disaster recovery – whether you’re on-prem or in the cloud. The documentation and the process matter more than where it’s hosted.

The takeaway: cloud platforms aren’t the problem. Lack of preparedness is.

What to Do Before an Outage

Outages can’t always be prevented, but their impact can be drastically reduced with the right preparation. Here are practical steps every CIO, CTO, and CMIO should be driving now:

  1. Validate your business-continuity devices monthly. Downtime computers and printers should be tested regularly to confirm that reports generate correctly, data is current, and printers still work. Don’t wait until an outage to discover someone unplugged a USB cable to charge their phone.
  2. Educate your clinical teams. Staff should know exactly how to access downtime reports, print documentation, and resume workflows offline. These processes should be documented in detail. If the process isn’t second nature, it’s not ready.
  3. Clarify decision-making authority. Document who makes the call to fail over, escalate, or communicate externally. When every minute counts, uncertainty can cost hours.
  4. Run realistic DR tests – not checkbox exercises. Annual “honor roll” tests may meet compliance requirements, but they rarely simulate actual failure conditions. Test what happens when you lose both production and DR simultaneously.
  5. Quantify the financial risk. One health system EHC worked with found that each hour of Epic downtime cost roughly $2 million in lost revenue – not including the long-term ripple effects on reputation and patient trust.

Preparedness is an operational discipline, not an IT project. The more routine it becomes, the less disruptive a real incident will be.

What to Do During an Outage

When an outage occurs, panic spreads quickly. The most effective leaders don’t rush to fix everything at once – they communicate, prioritize, and execute.

  1. Start with clear communication. Use a simple, templated message: describe the issue, note when it began, share any known resolution estimates, and commit to an update cadence. This builds trust and sets expectations.
  2. Stay close to both technical and clinical leadership. During an incident, I advise CIOs to monitor two channels – technical escalations and clinical risk reports. Staying responsive to both prevents technical fixes from outpacing patient-safety needs.
  3. Project calm and confidence. When leadership communicates clearly, it reassures staff and the broader community that care will continue safely. Silence, on the other hand, creates uncertainty – which can damage confidence long after systems recover.
  4. Focus on continuity, not blame. In healthcare, patients and their families don’t care whether the outage originated in AWS, Azure, or a hospital data center. They care whether their care team can still deliver. Keep your messaging centered on continuity and safety.

The Reality: Resilience is Shared Responsibility

Cloud adoption isn’t a gamble; it’s a strategy. But resilience doesn’t come from architecture alone – it comes from process, practice, and people.

The cloud doesn’t remove risk; it redistributes it. The question isn’t if something goes wrong – it’s how ready you are when it does.

Outages like this are rare, but when they happen, they’re impactful. The organizations that do well aren’t the lucky ones – they’re the ones that practiced.

Key Takeaways for Health System Leaders

Because resilience in health IT isn’t about avoiding failure – it’s about responding well when it happens.

Recent Posts

Epic on Azure: What You Need to Know.
This is default text for notification bar