Technical Vendor Requirements & Evaluation: Selecting Cooling and Power Systems for AI

12 June

Uptime Institute’s AI Infrastructure Advisory – Part 2 of 5

Once the design is locked (see Part 1), the next critical challenge is turning specifications into vendor contracts. In AI infrastructure, poor technology choices or ill‑defined RFPs can lead to years of operational headaches, stranded capacity, or unsafe conditions.

According to Uptime Institute’s Paper 2, a disciplined, vendor‑neutral evaluation process is essential. This post covers the technical considerations for cooling and power, plus a structured framework for vendor selection.

Cooling Technology: Direct‑to‑Chip vs. Immersion

Uptime Institute strongly recommends direct‑to‑chip liquid cooling (DLC) for the majority of high‑density AI deployments. Here is why, along with technical trade‑offs.

Direct‑to‑Chip (DLC)

How it works:
A water/glycol mix circulates through cold plates mounted directly on GPUs and other high‑heat components. The fluid removes 70–80% of the heat; the remaining 20–30% is handled by conventional air cooling (fan walls, CRAHs).

Technical advantages per Uptime:

Standard rack form factor – DLC components fit into 19‑inch racks (or with minor modifications). No need for custom tanks.
Mixed density support – A single row can contain DLC‑cooled GPU racks alongside air‑cooled storage or networking racks.
Retrofit possible – Existing data halls can be upgraded to DLC if floor loading and CDU space are available.
Resiliency compatible – DLC topologies can achieve concurrent maintainability (Tier III) or fault tolerance (Tier IV) with redundant CDUs, manifolds, and cooling loops.

Challenges:

Continuous cooling required – Loss of coolant flow can trigger GPU shutdown in <1 second. This demands UPS‑backed pumps and thermal storage.
Fluid cleanliness – Impurities can clog cold plates. The design must include filtration, testing, and maintenance protocols.
Leak detection – Dripless quick connectors and solenoid valves are necessary, but leak detection and containment must be built into every rack.

Immersion Cooling

How it works:
Servers are submerged in a dielectric fluid that transfers heat to facility water via heat exchangers.

When to consider (per Uptime):

Very high densities (projected beyond 200 kW/rack)
Long‑term commitment to a single form factor
Existing high‑performance computing (HPC) environments with immersion experience

Drawbacks:

Heavier infrastructure – Immersion tanks are larger and heavier than DLC racks, often requiring reinforced flooring and dedicated space.
Operational complexity – Staff must follow different procedures for hardware installation, removal, and maintenance.
Less flexible – Mixing immersion with air‑cooled or DLC racks in the same hall is difficult.
Limited vendor ecosystem – Fewer suppliers and less mature standards compared to DLC.

Uptime’s advice:
Immersion may be appropriate for niche cases, but for most organizations, DLC is the safer, more flexible, and more scalable choice.

Cooling Subsystems: What You Must Specify

Your RFP for cooling should include detailed requirements for:

  
    Component
    Technical Specifications to Include
  
    Coolant Distribution Units (CDUs)
    Redundancy (N+1, 2N), flow rate (L/min), pressure, secondary loop fluid type (water/glycol), leak detection, isolation valves.
  
    Cold plates
    Compatibility with target GPUs (e.g., NVIDIA Blackwell), thermal resistance, mounting mechanism, quick‑disconnect fittings.
  
    Manifolds (A/B)
    Redundant paths, dripless quick‑tee connectors, solenoid valves per port, pressure/temperature sensors.
  
    Air cooling (fan walls, CRAHs)
    Capacity for the residual 20–30% heat load, integration with DLC, variable speed drives for efficiency.
  
    Primary cooling plant
    Chillers (air‑cooled or water‑cooled), dry coolers, pumps, thermal storage tanks (TST), continuous cooling capability.
  
    Fluid management
    Filtration (micron rating), chemical treatment, monitoring (conductivity, pH), cleaning procedures.

Power Systems: Volatility, Voltage, and Resilience

AI power requirements differ from traditional IT in two critical ways:

1. Rapid load fluctuations

GPU‑based training clusters can swing from near‑idle to full draw in milliseconds, and they do so repeatedly. This can stress generators, UPS systems, and even the grid.

What to specify:

UPS systems rated for high‑di/dt (current change over time)
Generator response times for step loads (often <10 seconds to full power)
Power distribution units (PDUs) with sufficient headroom for peaks

2. Voltage and scale

For large AI facilities (e.g., 60 MW or 500 MW), distributing power at 415 V or 480 V requires massive, costly copper. Uptime notes that many projects are moving to medium‑voltage distribution - up to 800 VDC or higher.

Implications:

Equipment (switchgear, breakers, cabling) is less common and more expensive.
Safety training and lockout/tagout procedures must be updated.
On‑site primary power (gas turbines, and potentially nuclear in the future) may be deployed where grid capacity is insufficient.

Vendor Evaluation: A Structured Approach

Uptime Institute emphasises a vendor‑neutral, disciplined process to avoid lock‑in and ensure long‑term flexibility.

Step 1 – Pre‑qualification

Screen vendors based on:

Technology supported (DLC, immersion, air, hybrid)
Experience with AI‑scale deployments (e.g., >10 MW)
Geographic presence and service capabilities
Delivery track record (on‑time, on‑spec)

Uptime recommends evaluating at least three vendors per equipment type: UPS, generators, chillers, CDUs, etc.

Step 2 – Develop the RFP

Your RFP must include:

Technical requirements (from design specifications)
Delivery timelines – with financial consequences for delays
Future‑proofing – ability to upgrade, add capacity, or redeploy equipment
Service and support – onsite spares, response times, training
Testing and acceptance – factory witness tests (Level 1), site acceptance (Level 2), and integration testing (Levels 4–5)

Before final approval, Uptime recommends evaluating the completed design against three feasibility criteria:

Constructability: Can it be built on schedule at the intended location, given local labour, materials, and grid constraints?
Resiliency: Do the proposed topologies (5/4N, 4/3N, continuous cooling, thermal storage) meet the business’s availability targets?
Efficiency: Are cooling and power systems right‑sized to avoid over‑design? Is waste‑heat recovery feasible?

Step 3 – Bid evaluation

Create a standardised evaluation template aligned with business objectives. Criteria typically include:

Technical compliance (30%) - Meets power/cooling specs, redundancy, voltage, flow rates
Delivery & schedule (20%) - Lead time, ability to phase deliveries, penalty clauses
Cost (TCO) (25%) - Capital cost plus 10‑year maintenance, energy, and refresh
Vendor stability (10%) - Financial health, references, AI project experience
Service & support (10%) - Spares availability, mean time to repair (MTTR), local support
Future flexibility (5%) - Upgrade paths, compatibility with next-gen GPUs

Critical Operational Considerations During Vendor Selection

Uptime’s Paper 2 also highlights operational factors that are often missed:

Demarcation of responsibility – For DLC, who owns the coolant loop up to the rack manifold? The facility team? The IT team? The vendor? This must be contractually clear.
Training – Vendor‑provided training on DLC maintenance, leak response, and fluid handling is non‑negotiable.
Spare parts – Long lead times for DLC components (e.g., CDUs, cold plates) mean you need a spares strategy upfront. (read Lifecycle Intelligence)
Decommissioning – Equipment removal, fluid draining, and disposal should be part of the vendor agreement.

Summary: Technical Vendor Must‑Haves

Based on Uptime Institute’s Paper 2, your vendor evaluation process must deliver:

Cooling technology choice - DLC for most; immersion only for specific high‑density or HPC use cases.
Detailed cooling specifications - CDUs, manifolds, cold plates, air cooling, fluid management, continuous cooling capability.
Power system requirements - High performance UPS, generator step‑load response, medium‑voltage readiness, distributed redundant topologies (5/4N or 4/3N).
Structured RFP and evaluation – Three+ vendors per component, weighted criteria, delivery penalties, and future‑proofing clauses.
Service and training – Clear demarcation, spares, vendor‑provided operational training.
Owner control – Independent advisory (e.g., Uptime) to maintain neutrality, but final decisions rest with the owner.

Albert Wong