5 critical stages for building resilient, scalable, and efficient AI facilities

AI data centres are not traditional facilities with more powerful servers. They are fundamentally different. AI workloads -especially training - demand extreme power densities, hybrid liquid-air cooling, faster failure responses, and entirely new operational models.

Based on Uptime Institute’s five-part AI Infrastructure Advisory series, this article condenses the essential insights across the full lifecycle: design, vendor selection, construction, commissioning, and operations.

1. Design Development & Review

The foundation: high density, liquid cooling, and resiliency

Conventional data centres run at 5–15 kW per rack. AI training clusters routinely reach 40-130 kW per rack - and higher. A single rack of NVIDIA Blackwell GPUs can draw over 100 kW.

At these densities, air cooling fails. Direct liquid cooling (DLC) becomes essential, typically handling 70–80% of heat, with air cooling covering the rest. But DLC introduces new risks: a cooling failure can trigger GPU shutdown in under one second. Therefore, Uptime Institute recommends continuous cooling (UPS-backed) for all DLC facilities, even those designed for concurrent maintainability.

Space requirements change too. Racks weigh over 2,000 kg, reach 52U or taller, and require more gray space for coolant distribution units (CDUs), pumps, and power gear.

Key outcome: A scalable, resilient design that balances performance, cost, and future adaptability.

Part 1

2. Technical Vendor Requirements & Evaluation

Choosing the right cooling and power technologies

Not all liquid cooling is equal. Direct-to-chip (DLC) is preferred for most AI facilities because it works with standard racks, supports mixed densities, and aligns with Tier resiliency standards. Immersion cooling is less common, heavier, and less flexible.

Even with DLC, 20-30% of heat (from networking and storage) still requires air cooling. A 130 kW rack might need >100 kW liquid and ~30 kW air.

Power systems must handle rapid load fluctuations typical of GPUs. Grid constraints are common; some facilities deploy on-site gas turbines as primary power until grid capacity arrives. Medium-voltage distribution (up to 800 VDC) may be needed to manage cable costs.

Vendor evaluation best practices:

Evaluate at least three vendors per equipment type (UPS, generators, chillers, CDUs).
Include delivery timelines and penalties in RFPs.
Maintain owner control over all final decisions.

Key outcome: A future-ready, vendor-neutral technology strategy with clear RFP and evaluation criteria.

Part 2

3. Construction Oversight & Validation

Preventing design-to-build drift

Fast timelines and rapid AI evolution create a serious risk: what gets built often drifts from what was designed. Catching deviations late multiplies remediation costs.

AI facilities demand stronger buildings:

Higher ceilings (for taller racks and more cabling/piping)
Reinforced floors (racks >2,000 kg)
Wider corridors and loading areas
Pre-fabricated openings for multi-story connections (to reduce latency between GPUs)

Phased construction is critical. Uptime recommends building in modular halls (e.g., 5 MW each), allowing each phase to incorporate evolving best practices. Large plant items like chillers should not be locked in too early.

Retrofit warning: Converting an existing data center for AI training is rarely practical. It typically requires doubling power capacity, adding gray space, and moving traditional workloads.

Key outcome: Independent oversight ensures the as-built facility matches design intent, preserving resiliency and schedule.

Part 3

4. Level 4 & 5 Commissioning

Testing AI workloads, not just heaters

Standard commissioning uses load banks that act like electric heaters. AI loads behave differently—GPUs generate volatile heat and draw power in rapid, unpredictable cycles. Specialized load banks that simulate real GPU power and thermal profiles are required.

Liquid cooling adds complexity:

Testing pressure and flow across multiple coolant loops
Simulating planned and unplanned outages
Ensuring fluid cleanliness (contaminants clog cold plates)
Coordinating multiple suppliers in colocation settings

Level 4 (functional performance testing) and Level 5 (integrated system testing) must include DLC-specific scripts and AI-aware load banks. Uptime recommends third-party witnessing of Level 5 commissioning.

Decommissioning should also be considered early: heavy AI equipment and specialized coolants require unique removal procedures.

Key outcome: A fully tested facility that can handle the real-world volatility of AI workloads.

Part 4

5. Operations & Management Strategy

Where long-term value is made or lost

Design and construction are finite. Operations are not. Yet AI facilities have unforgiving failure modes: if coolant flow stops, a GPU can burn out in 30 seconds (compared to minutes for air-cooled servers).

Staffing and training are critical. Experienced personnel are scarce. Operators must define roles early, build training programs from scratch or adapt carefully, and plan for retention.

Demarcation between IT and facilities must be redefined. In air-cooled sites, facilities handle chilled water. In DLC, coolant circulates inside IT racks. Who manages contamination? Leak response? Clear boundaries are essential.

Safety risks increase with higher currents and potential medium-voltage distribution (800 VDC). Lockout/tagout, PPE, and no-work-on-energized-circuits policies must be reinforced.

Hardware lifecycles are shorter. GPUs last about three years (CPUs: up to 10 years). High failure rates are normal—Meta reports one failure every three hours in a 16,384-GPU cluster. Facilities must support frequent refresh cycles and changing cooling/power requirements per generation.

Required operational documentation includes SOPs, MOPs, EOPs, maintenance management systems, and formal qualification tracking.

Key outcome: A robust operations strategy that sustains reliability, safety, and efficiency over the facility’s entire life.

Part 5

Putting It All Together

Building an AI data centre is not simply scaling up a traditional design. It requires a holistic, stage-by-stage approach:

Design for high density, liquid cooling, and continuous resilience.
Select vendors with future-proof technologies and disciplined evaluation.
Construct with phased modularity and independent oversight to prevent drift.
Commission with AI-aware load banks and DLC-specific testing.
Operate with trained staff, clear demarcation, and lifecycle management.

Each stage builds on the last. Skipping or shortchanging any one of them risks costly failures, delayed deployments, or underperforming infrastructure.

In upcoming posts, we will expand each of these five parts into detailed, actionable guides. For organizations racing to deploy AI while protecting reliability and investment, this framework provides a proven roadmap.

This article is based on Uptime Institute’s AI Infrastructure Advisory white papers (1–5). For more information, visit uptimeinstitute.com.

From Design to Operations: A Complete Guide to AI Data Centre Infrastructure