Hardware Operations Engineer
OpenAIAbout the Team
OpenAI, in close collaboration with our capital partners, is building the world’s most advanced AI infrastructure ecosystem. Our Industrial Compute organization develops and deploys large-scale AI campuses designed to support the next generation of frontier model training and inference workloads.
The Hardware Operations team is responsible for ensuring the reliability, availability, and lifecycle health of OpenAI’s compute infrastructure. We partner closely with Data Center Operations, Fleet Health Engineering, Manufacturing, Network Infrastructure, Capacity Planning, and our infrastructure partners to maintain world-class operational performance across rapidly expanding AI environments.
As we scale globally, we are building the operational frameworks, reliability standards, and sustaining engineering practices required to support thousands of GPUs and servers across multiple campuses.
About the Role
We are seeking a Datacenter Hardware Technician Lead to serve as the senior on-site technical authority for hardware reliability and fleet health at one of OpenAI’s flagship AI campuses.
This role operates at the intersection of hardware operations, sustaining engineering, and fleet reliability. You will partner closely with Cloud Service Provider operations teams, OpenAI fleet-health engineers, hardware engineering teams, and OEM vendors to identify, diagnose, and resolve hardware issues affecting production systems.
Beyond day-to-day operational support, you will drive root cause investigations, reliability improvement initiatives, lifecycle management programs, and operational readiness efforts. You will help establish hardware maintenance standards, operational procedures, and best practices that scale across future OpenAI infrastructure deployments.
The ideal candidate combines deep hands-on datacenter hardware expertise with strong troubleshooting, failure analysis, and cross-functional leadership skills.
Candidates must be able to sit onsite at our datacenters 5 days per week.
Key Responsibilities
Drive technical triage and resolution of complex hardware failures impacting production systems.
Partner with Fleet Health Engineering to investigate recurring hardware issues, identify failure patterns, and improve fleet reliability.
Lead root cause analysis (RCA) efforts for critical hardware incidents and develop corrective and preventive action plans.
C