Level 4 & 5 Commissioning: Testing AI Facilities for Real-World Workloads
Uptime Institute’s AI Infrastructure Advisory – Part 4 of 5
Commissioning is the final quality assurance gate before a data centre goes live. According to Uptime Institute, traditional commissioning protocols are insufficient for AI facilities. The reason: AI workloads behave nothing like conventional IT.
GPUs generate intense, variable heat and draw power in rapid, unpredictable cycles. Standard load banks – essentially giant electric heaters – cannot simulate this behaviour. Moreover, direct liquid cooling (DLC) introduces new components (CDUs, manifolds, cold plates) that must be tested under realistic failure scenarios.
This post, based on Uptime Institute’s Paper 4, covers the five levels of commissioning, with deep dives into AI‑specific challenges for Level 4 (functional performance testing) and Level 5 (integrated system testing).
Commissioning Levels: A Quick Refresher
Uptime defines five standard levels, adapted for AI infrastructure:
| Level | Name | What It Verifies |
|---|---|---|
| 1 | Factory Acceptance (or Witness) Test - FAT | Equipment meets specifications before delivery (generators, UPS, chillers, CDUs). |
| 2 | Site Acceptance Test - SAT | Equipment shipped and installed correctly, no transit damage. |
| 3 | Pre-Function Testing | Individual components function in isolation (e.g., a single CDU pump starts and stops). |
| 4 | Functional Performance Testing | Subsystems work together – e.g., a CDU plus its connected cold plates under load. |
| 5 | Integrated System Testing - IST | Entire facility operates under partial and full load, including failure scenarios. |
For AI facilities, Levels 4 and 5 are where conventional methods break down. The rest of this post focuses on those levels.
The Load Bank Problem: AI Is Not a Heater
Conventional load banks are resistive heaters. They draw steady power and reject steady heat. AI training clusters do neither.
AI workload characteristics (per Uptime):
Volatile power draw – GPU clusters cycle between near‑idle and full load in milliseconds, repeatedly.
Variable heat output – Heat flux changes as rapidly as power, stressing cooling systems in ways steady loads cannot.
High peak densities - A 130 kW rack may draw 130 kW for seconds, drop to 30 kW, then spike again.
Consequence:
If you commission with resistive load banks, you will never know whether your UPS, generators, and cooling systems can handle real AI loads. The facility might pass all tests and then fail catastrophically on day one.
AI‑Aware Load Banks
Uptime Institute recommends specialised load banks that simulate the power profile and thermal output of actual GPU racks. These load banks must:
Connect to both power (electrical) and cooling (liquid and air) systems.
Support variable power profiles (scriptable power vs. time curves).
For DLC, include rack‑level simulators with connectors for facility coolant loops.
Allow mixing of liquid and air cooling loads (e.g., 100 kW liquid + 30 kW air per rack).
Technical detail:
Some operators choose to retain in‑house control of critical load bank components – such as cooling manifolds – to ensure cleanliness and compatibility. Uptime advises that load bank evaluation should be part of the commissioning plan review.
Integrated System Testing (Level 5): Putting It All Together
Level 5 is the final, most comprehensive test. The entire facility – power, cooling, controls, and load banks – operates under simulated AI workloads. Uptime recommends third‑party witnessing for Level 5 to ensure impartial verification.
What to Test at Level 5
Full load steady state – Run all racks at 100% power and cooling load for an extended period (e.g., 8 hours). Verify that DLC maintains GPU‑like temperatures and that air cooling handles residual heat.
Variable load cycling – Cycle load banks between low and high power (e.g., 20% → 100% → 20%) in seconds, repeated. Confirm that UPS and generators respond without voltage dips or frequency excursions.
Cooling failure simulation – Stop a primary CDU or chiller, relying on N+1 redundancy. Validate that thermal storage and continuous cooling activate before GPU temperatures rise.
Power failure simulation – Drop utility feed and transfer to generators. Ensure all cooling equipment on UPS remains online and generators accept the step load from DLC pumps.
Concurrent maintenance – Isolate one power path or cooling loop for “maintenance” while the facility runs. Verify remaining paths handle full AI load without overload.
Fault tolerance (if required) – Cause a component failure (e.g., open a breaker). The facility must continue with no impact. For Tier IV designs, confirm that compartmentalisation prevents failure spread.
Coordination Between IT and Facilities
In enterprise and hyperscale facilities, the same organisation owns both IT and infrastructure. Here, IT acceptance testing should be integrated with facility commissioning – not separate.
In colocation facilities, the tenant is responsible for IT. However, the facility commissioning must still demonstrate that it can support tenants’ AI workloads without undue stress. Uptime suggests:
Publish maximum power ramp rates and cooling capacity curves.
Test with load banks that mimic worst‑case tenant behaviour (e.g., aggressive power cycling).
Provide tenants with commissioning reports that include transient response data.
Documentation and Handover
Commissioning is not complete until documentation is handed over to the operations team. Uptime’s Paper 4 stresses that AI facilities require more detailed records than conventional sites.
Required deliverables:
Commissioning plan – Scope, schedule, testing methods, acceptance criteria.
Commissioning scripts – Step‑by‑step procedures for Levels 4 and 5, including expected outcomes and pass/fail thresholds.
Test results – Raw data logs (pressure, flow, temperature, power, voltage) plus exception reports.
As‑built drawings – Reflecting any changes made during construction or commissioning.
SOPs/MOPs/EOPs – Updated based on commissioning findings (see Part 5 for operations).
Uptime’s recommendation:
Use commissioning as a training opportunity. Operations personnel should witness or participate in Level 5 tests, so they understand normal and abnormal system behaviour before the facility goes live.
Decommissioning: The Forgotten Phase
Uptime Institute is unusual in including decommissioning considerations in a commissioning paper. The reason: AI facilities contain heavy, hazardous, and specialised equipment that will eventually need removal.
Decommissioning challenges:
Liquid cooling systems must be drained. Coolants (glycol mixtures, dielectric fluids) may require specialised handling and disposal.
Heavy equipment – CDUs, UPS modules, and racks >2,000 kg – may need structural reinforcement of removal paths or even cutting into pieces.
High‑voltage systems – Medium‑voltage (800 VDC) gear requires careful discharge and lockout.
What to do during commissioning:
Document procedures for draining, disconnecting, and removing each major component. Include these in the operations manual, even if decommissioning is years away.
Summary: Technical Commissioning Must‑Haves for AI Facilities
Based on Uptime Institute’s Paper 4, your commissioning plan must include:
AI‑aware load banks – Capable of simulating GPU power profiles (rapid cycling) and mixed liquid/air heat loads.
DLC‑specific testing – Fluid cleanliness, pressure/flow validation, continuous cooling (UPS‑backed), leak detection, and thermal storage.
Level 5 integrated testing – Full load, variable load, cooling failures, power failures, concurrent maintenance, and (if required) fault tolerance.
Third‑party witnessing – Independent verification of Level 5 testing to ensure impartial results.
Documentation and training – Commissioning scripts, test data, as‑built drawings, and operations personnel participation.
Decommissioning planning – Procedures for draining coolants, removing heavy equipment, and handling high‑voltage systems.