Job Description
AI Infrastructure Engineer (Python) Full Time 5-Days Onsite in NYC About the Role We are seeking an AI Infrastructure Engineer (Python) to support, scale, and enhance a production AI and data platform. This role sits at the intersection of AI infrastructure, cloud engineering, and agent-based systems. You will be responsible for ensuring the reliability, scalability, and performance of AI-driven systems operating in production environments across multi-cloud platforms (Azure and GCP). This is not a modeling or research role it's focused on building and maintaining the infrastructure that powers AI systems at scale. This is an excellent opportunity for someone with strong foundational engineering skills who is eager to deepen their expertise in AI platforms and cloud-native systems. What You ll Do Systems Engineering & Agent Operations Develop, maintain, and optimize production-grade Python code supporting data pipelines, agent workflows, and platform tooling Own the full lifecycle of Python services (containerization, deployment, versioning, runtime management) Manage environment configurations, secrets injection, and dependency management across containerized services Build internal Python tooling and shared libraries to accelerate development workflows Troubleshoot production issues end-to-end across application and infrastructure layers AI Platform & Scaling Operate and scale AI-driven agent systems in production environments Ensure high availability, performance, and resilience under load Support integrations between AI agents and data platforms Build observability tools (logging, monitoring, tracing, alerting) Implement auto-scaling strategies for containerized workloads Contribute to evaluation frameworks and quality standards for AI systems Infrastructure & Cloud Operations Develop and manage infrastructure using Terraform across Azure and GCP Manage cloud services including container registries, identity systems, secrets management, and networking Deploy and maintain workflow orchestration tools (e.g., Prefect) Maintain CI/CD pipelines and release workflows Document systems, workflows, and data lineage with clear runbooks What We re Looking For Required 3 5 years of experience in Software Engineering, DevOps, or MLOps Strong Python skills with experience building production systems Experience with Docker and containerized applications in cloud environments (Azure and/or GCP) Hands-on experience with Terraform Experience with secrets management tools and secure configuration practices Familiarity with CI/CD pipelines and Git-based workflows Strong troubleshooting and systems-thinking mindset Interest in AI systems and infrastructure Preferred Experience with Azure services (Container Apps, ACR, Key Vault, Managed Identities, VNets) Experience with GCP services (Cloud Run, GKE, Vertex AI, IAM, Secret Manager) Familiarity with workflow orchestration tools (e.g., Prefect) Exposure to AI/agent frameworks (e.g., LangChain, MCP) Experience with observability tools (e.g., MLflow, Langfuse) Experience with data tools such as dbt or Snowflake Familiarity with multi-cloud environments