ICONSTAFF
Cambridge, Massachusetts
04/24/2026
Full time
Job DescriptionJob Description Senior Platform/Infrastructure Engineer Location: Fully remote (HQ Cambridge, MA) Hours: 9-5 EST, with 2-day on-site visits every 6 weeks You'll be responsible for designing, scaling, and maintaining the infrastructure and internal developer platforms that power a real-time learning AI at a seed-stage startup. The role blends infrastructure ownership with platform engineering to enable AI/product teams to ship quickly and reliably. Key Responsibilities - Infrastructure Maintain production health: performance, reliability, cost efficiency, and security. Manage GCP Kubernetes clusters (GKE), networking, storage, and compute resources. Handle scaling, resource allocation, and high availability for growing customer demand. Refine observability: logs, traces, metrics, dashboards, and alerts. Perform security hardening and cost optimization. Kep Responsibilities - Platform Engineering Build internal tooling and abstractions for developer productivity. Design CI/CD pipelines using GitHub Workflows and ArgoCD. Provide self-service environments, internal portals, and deployment systems. Collaboration & Communication Work closely with AI and full-stack teams to optimize system architecture. Explain technical concepts and trade-offs clearly to engineers and non-engineers. Troubleshoot issues across multiple systems (Python, JavaScript, SQL). Requirements 5+ years in production cloud environments at scale. Strong familiarity with GCP (primary) and some AWS experience. Experience with Kubernetes (GKE), node pools, and memory-intensive jobs. Working knowledge of CI/CD systems (GitHub Workflows + ArgoCD). Exposure to observability tools (Datadog), databases (Cloud SQL, ClickHouse, Bigtable), and cloud services. Skills & Qualities Strong analytical and problem-solving ability. Clear, collaborative communication. Curiosity and ownership mentality. Fluent in reading/debugging code across Python, JavaScript, SQL. Technical Stack Cloud: GCP (primary), AWS (secondary) Kubernetes: GKE, multiple node pools CI/CD: GitHub Workflows + ArgoCD Data: Cloud SQL, ClickHouse, Bigtable, GCS, Dataflow Networking: Cloudflare Workers, Durable Objects, WebSocket communication Monitoring: Datadog Environments: Production, Staging, Integration, Development