From Air to Liquid Fire: Building the AI Factory - Why Old-School Data Centres Just Lost Their Cool

25 May

The cloud era built an empire on chilled air and redundancy. The AI era demands a factory - one that runs on superheated liquid, dense power, and brutal efficiency. Let’s tear down the blueprint and rebuild it.

The Quick Spec Check: Power, Cooling & Network

Before we talk construction, know the physics:

Power: Traditional racks: 5-10 kW. Cloud racks: 15-20 kW. AI racks: 40-120 kW+. This density breaks every rule.
Cooling: Cloud = Air conditioning. AI = Direct-to-Chip liquid cooling or immersion.
Network: AI needs fat-tree, zero-packet-loss fabrics (400G/800G fibre). One dropped packet can ruin a training run.

But the real battle is in the build - how you pour, prefab, and power this beast.

The Build Process: Vendor Expertise, Lock-In, and Liquid Compatibility

Building an AI data centre is not a construction project - it’s a systems integration nightmare that begins months before breaking ground.

Vendor Expertise & Lock-In

Traditional cloud builds let you mix-and-match: any vendor’s rack, any cooling unit, any switch. AI factories do not have that luxury. The GPU vendors (NVIDIA, AMD, etc.) certify specific liquid cooling loop pressure, flow rates, and connector types. Go outside that spec, and you void your hardware warranty – or worse, watch a row of H100s fry.

This creates vendor lock-in by necessity. Your cooling manifold supplier must match your GPU vendor’s internal manifold design. Your power train units (PTUs) must integrate with the GPU’s power sequencing. The builder becomes a translator between five proprietary ecosystems. If you choose the wrong partner early, you cannot swap them out later without redesigning the entire thermal and electrical architecture.

Component Compatibility - The Liquid Cooling Mess

Liquid cooling sounds simple: pump water to the chip. In reality, it's a compatibility maze:

Coolant type: dielectric fluid vs. deionized water vs. proprietary mixtures - not interchangeable.
Connectors: blind-mate quick disconnects vary by manufacturer. Mixing brands causes leaks or flow restriction.
Materials: copper vs. aluminium in the loop? Mixed metals cause galvanic corrosion – a slow death for your GPU fleet.

Every component - from the rack-level CDU to the facility-wide dry cooler - must be compatibility-tested as a system. You cannot spec them separately.

Equipment Specification & Performance – Higher Spec, Higher Pain

AI demands higher spec across the board:

Power Train Units (PTUs): In this context, PTUs are Power Train Units - the modular skids that contain medium-voltage transformers, switchgear, UPS modules, and distribution panels. Unlike traditional electrical rooms built piecemeal on-site, a PTU arrives fully assembled, tested, and ready to bolt down. For AI, PTUs must support 2x to 3x the fault current rating of cloud equivalents because GPU power supplies have massive inrush current.
Liquid pumps: Need variable frequency drives with 10-20% headroom beyond peak flow – because AI workloads pulse unpredictably.
Fibre optics: Not OM4 multimode. You need single mode OS2 with low-loss connectors - and every splice must be tested to <0.2dB loss. Failure means retraining jobs fail.

Specifying below these thresholds means ripping out your electrical busway six months after go-live. There is no "upgrade later" - you build for peak AI or you build twice.

CFD Challenges - The Hidden Showstopper

Computational Fluid Dynamics (CFD) is standard for traditional data centres to model air paths. For AI factories, CFD becomes a weapon.

Air-liquid hybrid zones: Racks generate 120kW of heat - 80% removed by liquid, 20% still dumped into the room air. That remaining 20% is still 24kW per rack – enough to melt standard ceiling tiles.
Hot spots: CFD modelling reveals that a single misaligned liquid manifold creates a recirculation vortex that overheats the top GPUs in a rack.
Leak propagation: Liquid cooling loops run at 2–3 bar pressure. A pinhole leak sprays atomised coolant that can short adjacent racks. CFD with particle tracing is now mandatory to map where that spray goes.

Without detailed CFD, you are gambling with millions of dollars in GPUs. Most AI factories run three CFD iterations before finalising pipe routing.

The Prefab Imperative: PTUs and Factory-Built Speed

You cannot stick-build an AI factory fast enough. Prefabrication is the only answer.

Power Train Units (PTUs): These factory-assembled electrical vaults arrive on a flatbed, are craned into place, and are operational within 48 hours. They contain all switchgear, transformers, and distribution – pre-tested at the factory. On-site electrical work drops from 4 months to 4 days.
Prefabricated Thermal Units (also PTUs): In cooling, PTUs mean pump skids and coolant distribution units. Built off-site, tested under load, shipped as modules.
Overhead liquid and busway: Prefabricated manifolds and electrical busways hang from the ceiling grid. No welding, no cutting, no guesswork.

The result: 50% faster deployment and near-zero on-site leaks.

Rethinking Tier Design: It’s Not About “Availability Nines”

The Uptime Institute’s Tier Standard (I-IV) is widely misunderstood. Tier does not mean "higher availability" in the simplistic sense. Instead, Tier defines functional capabilities:

Concurrent Maintainability (Tier III): You can maintain any component without shutting down IT.
Fault Tolerance (Tier IV): A single component failure (electrical or mechanical) does not affect IT.

Traditional cloud demands Tier III/IV because any downtime loses revenue. AI factories face a different reality.

Why Tier still matters - but differently:
AI training runs checkpoints. Losing power at hour 23 wipes that work – a multi-million dollar loss. So you need concurrent maintainability (Tier III) to service pumps or PTUs without stopping training. But do you need fault tolerance (Tier IV - 2N cooling, 2N power)? Many operators say no. Fault tolerance doubles your liquid cooling pipes and switchgear – which doubles the leak surface area and cost. Instead, AI factories use distributed redundancy (N+1) for cooling and block-level redundancy for power – functionally equivalent to Tier III but designed for liquid density.

The emerging compromise

Pursue Tier III for the power path to secure concurrent maintainability - the ability to service any component without shutting down the IT load. This functional capability is critical for AI factories, where a full power shutdown to repair a failed UPS or generator would interrupt a multi-day training run, potentially wasting millions in GPU compute. Avoid the capital cost of full Tier IV (typically 30–50% higher) which adds fault tolerance (2N redundancy) to every component.

For cooling, adopt a custom topology with N+1 redundancy rather than 2N. This reduces cooling-specific capital expenditure by roughly 20–35% and cuts cooling energy use by an estimated 15-20%, as idle redundant pumps and dry coolers are eliminated. The operational trade-off: a single pump or CDU failure may require a controlled GPU power throttle or a brief training pause, but with N+1, repair can be scheduled during the next maintenance window. Pure Tier IV with fully duplicated cooling loops adds significantly more piping, valves, and leak points - justified only for the most critical financial AI workloads where even a momentary thermal event is unacceptable. For the vast majority of AI factories, the quantified cost-benefit decisively favours this hybrid approach.

Post-Build: Operations – MOP, Maintenance, and Mindset

Once the concrete cures and the PTUs hum, the real work begins. Operational discipline separates a working AI factory from a flaming money pit.

MOP - Method of Procedure

In a traditional data centre, a simple filter change does not require a novel. In an AI factory, every action needs a detailed MOP:

Isolating a liquid loop: Step-by-step valve sequencing to avoid pressure surges that pop quick disconnects.
Replacing a GPU: The MOP must cover draining the local loop, purging air, refilling, and leak-checking – a four-hour procedure.
Powering down a PTU: You cannot just flip a circuit breaker. The MOP must sequence downstream GPUs into safe low-power mode to avoid inrush current spikes when power restores.

Without rigid MOPs, human error creates disasters. One technician closing the wrong valve can deadhead a pump and burst a hose.

Programmed Maintenance – Shorter Intervals, Higher Stakes

AI factories run hotter, wetter, and harder. Maintenance cycles shrink:

Liquid filters: Changed monthly (not annually) – debris from pipe corrosion accumulates fast.
Quick disconnects: Inspected every 3 months – O-rings dry out from coolant chemistry.
Busway torque checks: Quarterly – thermal cycling loosens connections.
Full system drain and flush: Annually – non-negotiable. Biofilm in liquid loops is a real threat.

Traditional cloud facilities schedule maintenance windows every 6–12 months. AI factories schedule them every 2–3 weeks, often at odd hours to align with training checkpoints.

Personnel Skillset Requirement - The New Hybrid Engineer

Your old data centre technician with a CompTIA cert is not qualified for an AI factory. The new role demands:

Liquid cooling literacy: Understanding pressure differentials, flow rates, pump curves, and corrosion chemistry.
High-voltage electrical safety: Many AI facilities run at 415V or higher to the rack - that is electrician territory, not IT.
Fibre optic diagnostics: OTDR testing and loss budgeting - because a dirty connector crashes training jobs.
MOP authorship: Not just following procedures, but writing and revising them as equipment changes.

Operators are now hiring mechanical engineers and power systems technicians and cross-training them on GPUs. The "IT guy" is being made redundant.

The Verdict

The shift from traditional cloud to AI factory is not evolution; it is a complete platform reset. You cannot retrofit a legacy data centre for 120kW racks and liquid cooling - you must build fresh, with prefabricated PTUs, vendor-aligned compatibility, and rigorous CFD validation.

But the prize is enormous: AI factories are the refineries of the digital age. They convert megawatts and litres per minute into intelligence. Those who embrace the liquid, the lock-in, and the operational rigour will lead. Those who cling to air-cooled, Tier-IV-everywhere dogma will be left behind.

Your Next Step: Engineering That Knows AI

You cannot build an AI factory with general contractors and legacy data centre playbooks. You need specialists who understand:

Vendor-specific power train integration - PTUs matched to GPU power curves.
Liquid cooling compatibility - from quick disconnects to dry coolers.
CFD for high-density hybrid cooling - air + liquid + particle tracing.
Tier-based functional design - concurrent maintainability without overbuilding.

Ecanet Engineers delivers exactly that:

👉 Explore our core engineering expertise:

👉 Ready for prefabrication? Our modular data centre solutions cut deployment time in half.

👉 Upskill your team. We are proud to offer the Accredited Operations Specialist (AOS) course in partnership with Uptime Institute – the global standard for data centre operations.

Don't build yesterday's data centre for tomorrow's AI. Talk to us.

Albert Wong