Job Description
Description: Must have skills • PowerEdge Rack/Tower Experience, RHEL and Ubuntu OS, Experience working in a larger server environment. • PowerEdge XE server experience - Poweredge XE-series GPU servers and accelerated/HPC environment • Strong experience with HPC operations in production environments (scheduling, monitoring, troubleshooting, capacity management). • Hands on expertise with NVIDIA GPU infrastructure (preferably GB200/GB300 class racks, H100/B100 or similar generations) including firmware/driver stack, monitoring, and lifecycle management. • Familiarity with liquid cooled data center infrastructure: working safely around cold plate / rear door heat exchanger systems, understanding facility interfaces (CDU, manifolds, leak detection, etc.). • Solid understanding of Linux administration (RHEL/CentOS/Ubuntu) in HPC/AI clusters: OS provisioning, patching, performance tuning, and troubleshooting. • Experience with cluster management tools (e.g., Bright, OpenHPC, Slurm, PBS/LSF, Kubernetes based AI stacks, or equivalent schedulers and orchestration frameworks). • Strong Day 2 operations mindset: incident response, root-cause analysis, change management, and operational runbook creation. • Excellent customer facing communication skills and the ability to work on site, embedded with the customer's team and their operations staff. Nice to have skills (if any) • Prior exposure to liquid-cooled or rack-scale GPU platforms (Nvidia HGX, NVLink, etc.) is highly desirable • Prior experience operating or supporting Dell XE/XE HPC & AI solutions and related management tooling. • Background in HP/HPE HPC environments (familiar with how HP is typically deployed/managed in HPC shops, to ease comparison and migration). • Exposure to AI/ML and GenAI workloads running at scale on NVIDIA GPU infrastructure (MLOps pipelines, model training/inference operations). • Experience with data center facilities coordination (power and cooling planning, rack integration, cabling standards, change windows). • Scripting skills in Python/Bash/PowerShell for automation, reporting, and integration with monitoring/ITSM systems. • Familiarity with ITIL aligned processes (incident, problem, and change management) and documenting runbooks/SOPs. • Strong scripting/automation skills, especially in Python and Bash, for automating operational workflows, health checks, and reporting across large GPU clusters. • Experience building or maintaining infrastructure automation (e.g., Ansible, Terraform, or similar) for repeatable deployment, configuration, and lifecycle management of HPC/AI nodes. • Some scripting skills in Python/Bash/PowerShell for automation, reporting, and integration with monitoring/ITSM systems. • Experience with DevOps / Git based workflows (GitLab/GitHub, CI/CD) to help with scripts, automation playbooks, and configuration as code.