It’s not your UPS. It’s not your generator. So why is your data centre still going dark?

15 June

Every year, Uptime Institute's global surveys give us the clearest view yet of what's keeping data centre operators up at night. Last month, we unpacked five key takeaways from the 2026 Resiliency Survey by Uptime Institute, focusing on where the industry is investing - monitoring and analytics (53%), electrical upgrades (49%), and a growing compliance imperative shaping how operators justify every dollar spent on resilience.

But the Resiliency Survey tells us what operators are doing. The newly released Annual Outage Analysis 2026 tells us what is actually failing - and the picture it paints is both reassuring and deeply uncomfortable.

Data centre server room with flickering rack lights, puzzled engineer inspecting cables, urgent atmosphere

Here's what every engineer, operator and decision-maker needs to know about where outages are coming from now, and where they are headed next.

Reliability gains are real - but we're hitting a plateau

For the fifth consecutive year, outage frequency on a per-site basis continues to decline. That is genuine progress, reflecting years of sustained investment in infrastructure, monitoring and operational maturity. Half of all operators now report no impactful outage in the past three years - down from 74% in 2020.

But here's the problem: the pace of improvement has slowed significantly. Around one in ten operators still report that their most recent outage had serious or severe impacts. And as Andy Lawrence, Executive Director of Uptime Intelligence, put it: "Further resiliency gains are becoming harder to achieve."

We're seeing diminishing returns from traditional resilience approaches. The low-hanging fruit has been picked.

The threat is moving outside the fence line

Perhaps the most significant shift in this year's analysis is the rising prominence of external failures. When Uptime tracks publicly reported outages - the headline-makers that affect real customers - a clear pattern emerges.

Fiber and connectivity-related incidents are now the leading cause of major public outages, and they showed the largest year-over-year increase among all categories. These incidents also tend to last longer, with the share of outages exceeding 48 hours increasing for the second consecutive year.

Why does this matter? Because these assets - subsea cables, terrestrial fiber routes, third-party network infrastructure - are largely outside an operator's direct control. They're vulnerable to accidental damage, sabotage, theft, and extreme weather. And when they fail, the blast radius can be enormous, affecting multiple services and regions simultaneously.

At the same time, telecommunications providers saw a sharp rise in outages in 2025, well above the five-year average. Cloud and internet giants, by contrast, saw their share decline – a testament to ongoing investments in distributed resilience.

Takeaway for operators: Your resilience assessment probably focuses heavily on your own electrical and mechanical systems. But if you're not evaluating third-party dependencies - network providers, cloud services, even the utility grid – you're only looking at half the risk picture. Only about 40% of operators currently assess external and systemic risks. That number needs to climb.

Power remains the heavyweight champion

Despite all the talk about network outages and external threats, power failures remain the single largest cause of impactful data centre outages, accounting for 45% of events in 2025.

Inside the facility, the usual suspects dominate: UPS failures, transfer switch failures, and generator issues. But there are new pressure points emerging. Worsening grid constraints, high-density AI workloads, and operators running much closer to capacity limits are all contributing to a more fragile operating environment.

There's also a less-discussed factor: critical equipment shortages. Transformers, generators, switchgear and UPS systems are in short supply globally, forcing some operators to rely on substitute components or second-user equipment. Uptime believes this has already contributed to several failures.

Takeaway for operators: Don't take your power chain for granted. The fundamentals - UPS health, transfer switch logic, generator start sequences - remain the bedrock of resilience. But now you also need to think about grid exposure, load variability from AI workloads, and the provenance of any substitute components in your system.

Human error: the unavoidable factor

Here's a number that should give every operations manager pause: 92% of respondents say human error contributed to their most recent impactful outage - with 31% calling it a major contributor.

The most common failure modes haven't changed: failure to follow established procedures, inconsistent or unclear processes, and installation or in-service errors. And here's the kicker: 87% of operators who experienced an impactful outage in the past three years say it could have been avoided with better management, processes or configuration.

That is not a technology problem. That's a people-and-process problem.

Takeaway for operators: Training matters. Change management discipline matters. Clear, consistent procedures matter. Before you invest another dollar in redundant hardware, ask yourself whether your operational practices are creating the very risks you are trying to engineer away.

The cost of failure keeps climbing

For the second year running, one in five operators say their most recent impactful outage cost more than USD 1 million. More than half (57%) reported costs exceeding USD 100,000.

These figures reflect persistent inflation in labour and hardware costs, SLA penalties, and longer recovery times. But they also reflect something more fundamental: the increasing number of services and businesses that depend, directly or indirectly, on a single data centre or availability zone. When you fail, you're not just taking yourself offline - you're taking your customers offline too.

The cloud reliability paradox

Despite the industry's heavy shift toward cloud and outsourced services, enterprise confidence in public cloud resiliency appears to be weakening. The share of respondents who say public cloud is not resilient enough for any mission-critical workloads increased by six percentage points year-over-year.

Why? High-profile cloud provider failures in 2025 - including a major AWS outage that generated over 17 million Downdetector reports and lasted more than 15 hours - have reminded everyone that "the cloud" is just someone else's data centre, with its own single points of failure and complex interdependencies.

Nearly one in five organizations now insure against outages involving cloud providers – a figure that has been rising gradually year over year.

Takeaway for operators: If you're relying on a single cloud provider or a single region for mission-critical workloads, you're carrying more risk than you may realise. Multi-cloud strategies, active-active architectures, and clear failover plans aren't optional anymore - they are the baseline.

The AI wildcard

The rapid build-out of AI training facilities introduces new and poorly understood risks. These sites operate at massive scale and energy intensity, with workloads that repeatedly stop and start model runs due to IT component failures. Major power or cooling failures at an AI training facility could prove extraordinarily expensive due to lost work, even if end-user customers are unaffected.

There is also a broader systemic risk: very large data centres are starting to strain regional power grids. There have already been near-miss incidents that could have resulted in major grid outages.

Takeaway for operators: If you are involved in AI infrastructure, do not assume that "training does not need resilience." The cost of failure is higher than you think, and the ripple effects may extend well beyond your own operations.

What this means for you

Uptime's 2026 outage analysis tells a clear story: the industry has made genuine progress, but the easy gains are behind us. Future reliability will depend less on adding more redundancy and more on managing complexity, extending visibility beyond the facility fence line, and getting the fundamentals of people and process right.

For Australian operators, these global trends land in a specific local context - ageing electrical infrastructure in some facilities, growing grid constraints in data centre hubs, and a regulatory environment that is moving steadily toward tighter resilience requirements.

The question is not whether you will face an outage. The question is whether you will see it coming before it takes you offline.

ECANET provides specialist engineering, monitoring and compliance services across the full data centre lifecycle - electrical, mechanical, automation and facility management. From HV infrastructure and power chain auditing to real‑time monitoring platforms, we help clients across Australia and Asia‑Pacific see risk before it becomes reality. Contact us to strengthen your resilience posture.

Albert Wong