GatherJob
Back to jobs
OpenAI

Software Engineer, Frontier Systems - Power Management

OpenAI
On-site Today

About the Team

The Frontier Systems team at OpenAI builds, launches, and supports the largest supercomputers in the world that OpenAI uses for its most cutting edge model training.

We take data center designs, turn them into real, working systems and build any software needed for running large-scale frontier model trainings.

Our mission is to bring up, stabilize and keep these hyperscale supercomputers reliable and efficient during the training of the frontier models.

About the Role

As a Software Engineer on the Frontier Systems team focused on power management, you will work on critical infrastructure to support cutting-edge research. With large-scale supercomputers consuming substantial amounts of power, managing this efficiently is key to maximizing computational capacity.  This role is critical to ensuring that our cutting-edge research supercomputing infrastructure runs smoothly, while maintaining reliability and grid-level power stability.

Our team empowers strong engineers with a high degree of autonomy and ownership, as well as ability to effect change. This role will require a keen focus on system-level comprehensive investigations and the development of automated solutions. We want people who go deep on problems, investigate as thoroughly as possible, and build automation for detection and remediation at scale. 

In this role, you will:

  • Develop and implement system-level and software-level solutions to optimize power usage in large-scale supercomputers, ensuring efficient and reliable operations.

  • Build automation to monitor power consumption patterns during training workloads and design algorithms to stabilize these fluctuations, preventing issues with grid reliability.

  • Work with researchers and engineers to design tools for real-time monitoring, detection, and remediation of power-related hardware and system faults.

  • Collaborate cross-functionally to translate complex electrical system requirements into code, while driving continuous improvements in power management solutions.

  • Drive the development of power throttling mechanisms at the IT system level to dynamically adjust power usage based on workload demands and infrastructure limitations.

  • Collaborate with hardware design teams to integrate system-level power control requirements into IT hardware design, ensuring seamless coordination between software-driven power management and hardware capabilities.

Apply now

Opens the company's application page

About the company

OpenAI

OpenAI

AI research and deployment company.