Staff Engineer, Distributed Storage and HPC & AI Infrastructure About the Role In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world's largest AI training and inference workloads. You'll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization routinely achieving % savings through intelligent tiering, lifecycle policies, capacity forecasting, and right sizing. You will also build Kubernetes native storage operators and self service platforms that provide automated provisioning, strict multi tenancy, performance isolation, and quota enforcement at cluster scale. Day to day, you'll optimize end to end data paths for GB/s per node, design multi tier caching architectures, implement intelligent prefetching and model weight distribution, and tune parallel filesystems for AI workloads. Responsibilities Design multi petabyte AI/ML storage systems; integrate WekaFS, Ceph, etc.; lead capacity planning and cost optimization (% savings via tiering, lifecycle policies, right sizing). Design/optimize RDMA, InfiniBand, 400GbE networks; tune for max throughput/min latency; implement NVMe oF/iSCSI; troubleshoot bottlenecks; optimize TCP/IP for storage. Build Kubernetes storage operators/controllers; enable automated provisioning, self service abstractions, multi tenant isolation, quotas; create reusable Helm/Terraform patterns. Deliver GB/s per GPU node; optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths; troubleshoot with profiling tools; scale to thousands of nodes. Build multi tier caches (local NVMe, distributed, object); optimize data locality and model weight distribution; implement smart prefetching/eviction. Implement monitoring, alerting, SLOs; design DR/backups with runbooks; run chaos engineering; ensure 99.9%+ uptime via proactive/automated remediation. Partner with ML/SRE teams; mentor on storage best practices; contribute to open source; write docs, postmortems, and public learnings. Requirements 8+ years in storage engineering with 3+ years managing distributed storage at multi petabyte scale. Proven track record deploying and operating high performance storage for GPU/HPC clusters. Deep Kubernetes and cloud native storage experience in production environments. Strong coding skills in Go and Python with demonstrated ability to build production grade tools. BS/MS in Computer Science, Engineering, or equivalent practical experience. History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost. Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi petabyte scale. Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management. Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers. Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput). Programming: Go and Python for automation, operators, and tooling. Infrastructure as Code: Terraform, Ansible, Helm, GitOps (ArgoCD). Linux Storage Stack: Advanced knowledge of filesystems (ext4, xfs), LVM, NVMe optimization, RAID configurations. Observability: Prometheus, Grafana, Thanos architecture and operations. Nice to Have Skills GPU Direct Storage (GDS), NVMe oF, storage networking (100GbE/400GbE). Storage snapshots, cloning, and thin provisioning. Backup and disaster recovery (Velero, Restic, cross region replication). Storage encryption (at rest and in transit), security and compliance. Storage benchmarking and profiling tools (fio, iperf3, iostat, blktrace). About Together AI Together AI is a research driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co designing software, hardware, algorithms, and models. We have contributed to leading open source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full time position is: $160,000 - 260,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Interested in building your career at Together AI? Get future opportunities sent straight to your email.
04/02/2026
Full time
Staff Engineer, Distributed Storage and HPC & AI Infrastructure About the Role In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world's largest AI training and inference workloads. You'll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization routinely achieving % savings through intelligent tiering, lifecycle policies, capacity forecasting, and right sizing. You will also build Kubernetes native storage operators and self service platforms that provide automated provisioning, strict multi tenancy, performance isolation, and quota enforcement at cluster scale. Day to day, you'll optimize end to end data paths for GB/s per node, design multi tier caching architectures, implement intelligent prefetching and model weight distribution, and tune parallel filesystems for AI workloads. Responsibilities Design multi petabyte AI/ML storage systems; integrate WekaFS, Ceph, etc.; lead capacity planning and cost optimization (% savings via tiering, lifecycle policies, right sizing). Design/optimize RDMA, InfiniBand, 400GbE networks; tune for max throughput/min latency; implement NVMe oF/iSCSI; troubleshoot bottlenecks; optimize TCP/IP for storage. Build Kubernetes storage operators/controllers; enable automated provisioning, self service abstractions, multi tenant isolation, quotas; create reusable Helm/Terraform patterns. Deliver GB/s per GPU node; optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths; troubleshoot with profiling tools; scale to thousands of nodes. Build multi tier caches (local NVMe, distributed, object); optimize data locality and model weight distribution; implement smart prefetching/eviction. Implement monitoring, alerting, SLOs; design DR/backups with runbooks; run chaos engineering; ensure 99.9%+ uptime via proactive/automated remediation. Partner with ML/SRE teams; mentor on storage best practices; contribute to open source; write docs, postmortems, and public learnings. Requirements 8+ years in storage engineering with 3+ years managing distributed storage at multi petabyte scale. Proven track record deploying and operating high performance storage for GPU/HPC clusters. Deep Kubernetes and cloud native storage experience in production environments. Strong coding skills in Go and Python with demonstrated ability to build production grade tools. BS/MS in Computer Science, Engineering, or equivalent practical experience. History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost. Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi petabyte scale. Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management. Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers. Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput). Programming: Go and Python for automation, operators, and tooling. Infrastructure as Code: Terraform, Ansible, Helm, GitOps (ArgoCD). Linux Storage Stack: Advanced knowledge of filesystems (ext4, xfs), LVM, NVMe optimization, RAID configurations. Observability: Prometheus, Grafana, Thanos architecture and operations. Nice to Have Skills GPU Direct Storage (GDS), NVMe oF, storage networking (100GbE/400GbE). Storage snapshots, cloning, and thin provisioning. Backup and disaster recovery (Velero, Restic, cross region replication). Storage encryption (at rest and in transit), security and compliance. Storage benchmarking and profiling tools (fio, iperf3, iostat, blktrace). About Together AI Together AI is a research driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co designing software, hardware, algorithms, and models. We have contributed to leading open source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full time position is: $160,000 - 260,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Interested in building your career at Together AI? Get future opportunities sent straight to your email.
Together AI is looking for an ML Engineer who will develop systems and APIs that enable our customers to perform inference and fine tune LLMs. Relevant experience includes implementing runtime systems that perform inference at scale using AI/ML models from simple models up to the largest LLMs. Requirements 5+ years experience writing high-performance, well-tested, production quality code Bachelor's degree in computer science or equivalent industry experience Familiar with LLM inference ecosystem, including frameworks and engines (e.g. vLLM, SGLang, TRT, ) Demonstrated experience in building large scale, fault tolerant, distributed systems like storage, search, and computation Expert level programmer in one or more of Python, Go, Rust, or C/C++ Experience implementing runtime inference services at scale or similar Responsibilities Design and build the production systems that power the Together Cloud inference and fine-tuning APIs, enabling reliability and performance at scale Partner with researchers, engineers, product managers, and designers to bring new features and research capabilities to the world Analyze and improve efficiency, scalability, and stability of various system resources Conduct design and code reviews Create testing frameworks for robustness and fault-tolerance Participate in an on-call rotation to respond to critical incidents as needed About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $220,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Accepted file types: pdf, doc, docx, txt, rtf Enter manually Accepted file types: pdf, doc, docx, txt, rtf LinkedIn Profile Website Are you willing to work four days per week in our SF Office?
04/02/2026
Full time
Together AI is looking for an ML Engineer who will develop systems and APIs that enable our customers to perform inference and fine tune LLMs. Relevant experience includes implementing runtime systems that perform inference at scale using AI/ML models from simple models up to the largest LLMs. Requirements 5+ years experience writing high-performance, well-tested, production quality code Bachelor's degree in computer science or equivalent industry experience Familiar with LLM inference ecosystem, including frameworks and engines (e.g. vLLM, SGLang, TRT, ) Demonstrated experience in building large scale, fault tolerant, distributed systems like storage, search, and computation Expert level programmer in one or more of Python, Go, Rust, or C/C++ Experience implementing runtime inference services at scale or similar Responsibilities Design and build the production systems that power the Together Cloud inference and fine-tuning APIs, enabling reliability and performance at scale Partner with researchers, engineers, product managers, and designers to bring new features and research capabilities to the world Analyze and improve efficiency, scalability, and stability of various system resources Conduct design and code reviews Create testing frameworks for robustness and fault-tolerance Participate in an on-call rotation to respond to critical incidents as needed About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $220,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Accepted file types: pdf, doc, docx, txt, rtf Enter manually Accepted file types: pdf, doc, docx, txt, rtf LinkedIn Profile Website Are you willing to work four days per week in our SF Office?
Overview As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a software engineer that applies sound engineering principles, operational discipline, and mature automation to our operating environments and codebase. Qualifications 5+ years of professional SRE or related experience Bachelor's degree in Computer Science or a related field or equivalent work experience Expert knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes Proficiency in programming/scripting languages Direct experience in monitoring and observability practices Advanced knowledge of cloud services Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts Responsibilities Be on an on-call (PagerDuty) rotation to respond to incidents that impact availability Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users Build monitoring systems to ensure the highest quality service for our customers Design and implement operational processes (such as deployments and upgrades) Debug production issues across all services and levels of the stack Identify improvements for the product architecture from the reliability, performance and availability perspectives Plan the growth of Together AI's infrastructure About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at
04/02/2026
Full time
Overview As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a software engineer that applies sound engineering principles, operational discipline, and mature automation to our operating environments and codebase. Qualifications 5+ years of professional SRE or related experience Bachelor's degree in Computer Science or a related field or equivalent work experience Expert knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes Proficiency in programming/scripting languages Direct experience in monitoring and observability practices Advanced knowledge of cloud services Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts Responsibilities Be on an on-call (PagerDuty) rotation to respond to incidents that impact availability Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users Build monitoring systems to ensure the highest quality service for our customers Design and implement operational processes (such as deployments and upgrades) Debug production issues across all services and levels of the stack Identify improvements for the product architecture from the reliability, performance and availability perspectives Plan the growth of Together AI's infrastructure About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at
Overview Solutions Architect Location: San Francisco, CA (Hybrid) About the role: As a Solutions Architect at Together AI, you will work with customers and prospects to create business value through Generative AI applications. Solutions Architects at Together are trusted advisors to our customers that evaluate, identify and demonstrate how Together can solve their AI needs. As key contributors to our sales organization, Solution Engineers add tremendous value to the customer journey and directly impact company growth and revenue. This is an exciting opportunity for a deeply technical professional passionate about AI and customer success to make a significant impact in a fast-paced, innovative environment. Responsibilities Act as a technical advisor to our most strategic customers, deeply embedding with them to support the ideation and development of innovative applications using OSS models on Together AI Run complex demonstrations and POCs of Together's entire stack, including both hardware and software solutions Collaborate with sales to qualify new prospects and support existing customers along their journey to build cutting-edge Generative AI solutions Build and maintain strong relationships with customer leadership and stakeholders, ensuring the successful deployment and scaling of their applications Deliver high-value feedback to our Product, Engineering, and Research teams, ensuring our platform continues to evolve to meet customer needs Build educational content and tooling for both internal and external use around Together's solutions (i.e., playbooks, blogs, demos, etc.) Qualifications 5+ years of experience in a customer-facing technical role with at least 2 years in a pre-sales function Excellent communication and interpersonal skills, with the ability to explain complex technical concepts to non-technical stakeholders. Ability to consult with new and existing customers to map business needs to technical solutions Strong technical background, with knowledge of AI, ML, GPU technologies and their integration into high-performance computing (HPC) environments. Strong understanding of training, fine-tuning and inference in the context of open source LLMs Proficiency in Python and JavaScript, with experience building and delivering prototypes on API platforms. Familiarity with infrastructure services (e.g., Kubernetes, SLURM), infrastructure as code solutions (e.g., Ansible), container infrastructure Docker), and scripting and programming languages (Python, Javascript) Strong sense of ownership and willingness to learn new skills to ensure both team and customer success. Ability to operate in dynamic environments, adept at managing multiple projects, and comfortable with frequent context switching and prioritization. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $180-260K + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our Privacy Policy at
04/02/2026
Full time
Overview Solutions Architect Location: San Francisco, CA (Hybrid) About the role: As a Solutions Architect at Together AI, you will work with customers and prospects to create business value through Generative AI applications. Solutions Architects at Together are trusted advisors to our customers that evaluate, identify and demonstrate how Together can solve their AI needs. As key contributors to our sales organization, Solution Engineers add tremendous value to the customer journey and directly impact company growth and revenue. This is an exciting opportunity for a deeply technical professional passionate about AI and customer success to make a significant impact in a fast-paced, innovative environment. Responsibilities Act as a technical advisor to our most strategic customers, deeply embedding with them to support the ideation and development of innovative applications using OSS models on Together AI Run complex demonstrations and POCs of Together's entire stack, including both hardware and software solutions Collaborate with sales to qualify new prospects and support existing customers along their journey to build cutting-edge Generative AI solutions Build and maintain strong relationships with customer leadership and stakeholders, ensuring the successful deployment and scaling of their applications Deliver high-value feedback to our Product, Engineering, and Research teams, ensuring our platform continues to evolve to meet customer needs Build educational content and tooling for both internal and external use around Together's solutions (i.e., playbooks, blogs, demos, etc.) Qualifications 5+ years of experience in a customer-facing technical role with at least 2 years in a pre-sales function Excellent communication and interpersonal skills, with the ability to explain complex technical concepts to non-technical stakeholders. Ability to consult with new and existing customers to map business needs to technical solutions Strong technical background, with knowledge of AI, ML, GPU technologies and their integration into high-performance computing (HPC) environments. Strong understanding of training, fine-tuning and inference in the context of open source LLMs Proficiency in Python and JavaScript, with experience building and delivering prototypes on API platforms. Familiarity with infrastructure services (e.g., Kubernetes, SLURM), infrastructure as code solutions (e.g., Ansible), container infrastructure Docker), and scripting and programming languages (Python, Javascript) Strong sense of ownership and willingness to learn new skills to ensure both team and customer success. Ability to operate in dynamic environments, adept at managing multiple projects, and comfortable with frequent context switching and prioritization. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $180-260K + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our Privacy Policy at
Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. As a Senior Developer Productivity Engineer at Together AI, you'll own the systems and tooling that empower engineers to ship high-quality software faster. You'll optimize workflows, enhance testing, enable reliable and reusable CI/CD, and work with developers to build out stable local environments. Your work will directly impact release velocity, developer happiness, and cross-cutting enablement ensuring engineers spend less time churning on infrastructure and more time building. Requirements Bachelor's degree in Computer Science, Engineering, or related field or 5+ years of industry experience in DevOps/SRE, developer tooling, or infrastructure engineering. Strong experience with CI/CD tools (GitHub Actions, ArgoCD, Gitops methodology) for building scalable, reusable pipelines. Experience with local dev environment tooling (e.g., Skaffold) and containerized development workflows Experience with creating starter templates in coordination with engineering for enabling rapid spin up of new services Strong ownership and a builder mindset, you love creating tools others rely on. Problem-solving rigor and attention to detail (e.g., diagnosing flaky tests, build latency). Nice to have Kubernetes expertise (EKS, K3s) and optimizing containerized builds. Infrastructure as Code (Terraform, Ansible, Pulumi). Front-end tooling familiarity (e.g., React, Next.js, Jest) to optimize FE dev workflows. Monitoring/observability (Prometheus, Grafana, Honeycomb) to debug bottlenecks. Responsibilities Create smart pipelines encouraging reusable workflows and simplicity Streamline build/test/deploy workflows. Build shared tooling (CLIs, codegen, IDE plugins) to accelerate teams. Reduce friction (e.g., faster builds, hot-reload, test tooling). Collaborate with developers to identify pain points and streamline workflows. Champion best practices through documentation. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $150,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at Apply for this job indicates a required field First Name Last Name Email Phone Resume/CV Enter manually Accepted file types: pdf, doc, docx, txt, rtf Enter manually Accepted file types: pdf, doc, docx, txt, rtf Education School Select Degree Select Select Start date year End date year LinkedIn Profile Are you willing to work four days per week in our San Francisco office? Select Are you legally authorized to work in the USA?
04/02/2026
Full time
Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. As a Senior Developer Productivity Engineer at Together AI, you'll own the systems and tooling that empower engineers to ship high-quality software faster. You'll optimize workflows, enhance testing, enable reliable and reusable CI/CD, and work with developers to build out stable local environments. Your work will directly impact release velocity, developer happiness, and cross-cutting enablement ensuring engineers spend less time churning on infrastructure and more time building. Requirements Bachelor's degree in Computer Science, Engineering, or related field or 5+ years of industry experience in DevOps/SRE, developer tooling, or infrastructure engineering. Strong experience with CI/CD tools (GitHub Actions, ArgoCD, Gitops methodology) for building scalable, reusable pipelines. Experience with local dev environment tooling (e.g., Skaffold) and containerized development workflows Experience with creating starter templates in coordination with engineering for enabling rapid spin up of new services Strong ownership and a builder mindset, you love creating tools others rely on. Problem-solving rigor and attention to detail (e.g., diagnosing flaky tests, build latency). Nice to have Kubernetes expertise (EKS, K3s) and optimizing containerized builds. Infrastructure as Code (Terraform, Ansible, Pulumi). Front-end tooling familiarity (e.g., React, Next.js, Jest) to optimize FE dev workflows. Monitoring/observability (Prometheus, Grafana, Honeycomb) to debug bottlenecks. Responsibilities Create smart pipelines encouraging reusable workflows and simplicity Streamline build/test/deploy workflows. Build shared tooling (CLIs, codegen, IDE plugins) to accelerate teams. Reduce friction (e.g., faster builds, hot-reload, test tooling). Collaborate with developers to identify pain points and streamline workflows. Champion best practices through documentation. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $150,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at Apply for this job indicates a required field First Name Last Name Email Phone Resume/CV Enter manually Accepted file types: pdf, doc, docx, txt, rtf Enter manually Accepted file types: pdf, doc, docx, txt, rtf Education School Select Degree Select Select Start date year End date year LinkedIn Profile Are you willing to work four days per week in our San Francisco office? Select Are you legally authorized to work in the USA?
About the Role Together AI is building the Inference Platform that powers the world's most advanced generative AI models. Your role will be a critical bridge between cutting edge research and real world applications, focusing on making translating our internal model training research to production ready deployment for our customers. This involves a deep commitment to data centric development, meticulous hyperparameter tuning, and rigorous checkpoint evaluation before models ever hit production. This role will involve understanding customer specific needs and fine tuning models on our internal data recipe and their proprietary data. The goal is to transform general purpose models into highly performant, specialized tools that solve real business problems. You will not be training foundation models from scratch but rather focusing on creating highly efficient, specialized models by working with dedicated GPU clusters. Responsibilities Design and iterate on novel speculator algorithms, combining architectural innovations with carefully curated data to push the frontier of accuracy-efficiency tradeoffs. Be the critical link between raw data and a production ready model, seeing your work directly impact our customers' success. Work in a fast paced, high impact role at the cutting edge of generative AI. Collaborate with a team of experts dedicated to solving real world, high performance challenges. You'll collaborate directly with customers to understand their needs, and work closely with our core inference and Applied ML research teams to integrate your work into the production platform. A culture of deep technical ownership where you are empowered to take on and solve challenging problems. Requirements A genuine love for data curation and processing, with a meticulous attention to detail. You believe that great models start with great data. Demonstrated ability to perform effective hyperparameter searches and understand the trade offs involved in tuning models for specific tasks. Experience working with and building on top of existing training codebases. You are comfortable navigating complex code and contributing to its improvement. Strong attention to detail in evaluating model checkpoints to ensure they meet strict quality, performance, and reliability standards. Experience with Python and PyTorch. Familiarity with SLURM and/or Kubernetes clusters and experience submitting and managing jobs in a high performance computing environment. Familiarity with modern LLMs and generative models. Basic understanding of distributed training frameworks (e.g., FSDP, DeepSpeed). Bachelor's, Master's degree, or Ph.D. in Computer Science, Computer Engineering, or a related field, or equivalent practical experience. About Together AI Together AI is a research driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co designing software, hardware, algorithms, and models. We have contributed to leading open source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, ATLAS, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full time position is: $190,000 - $270,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.
04/02/2026
Full time
About the Role Together AI is building the Inference Platform that powers the world's most advanced generative AI models. Your role will be a critical bridge between cutting edge research and real world applications, focusing on making translating our internal model training research to production ready deployment for our customers. This involves a deep commitment to data centric development, meticulous hyperparameter tuning, and rigorous checkpoint evaluation before models ever hit production. This role will involve understanding customer specific needs and fine tuning models on our internal data recipe and their proprietary data. The goal is to transform general purpose models into highly performant, specialized tools that solve real business problems. You will not be training foundation models from scratch but rather focusing on creating highly efficient, specialized models by working with dedicated GPU clusters. Responsibilities Design and iterate on novel speculator algorithms, combining architectural innovations with carefully curated data to push the frontier of accuracy-efficiency tradeoffs. Be the critical link between raw data and a production ready model, seeing your work directly impact our customers' success. Work in a fast paced, high impact role at the cutting edge of generative AI. Collaborate with a team of experts dedicated to solving real world, high performance challenges. You'll collaborate directly with customers to understand their needs, and work closely with our core inference and Applied ML research teams to integrate your work into the production platform. A culture of deep technical ownership where you are empowered to take on and solve challenging problems. Requirements A genuine love for data curation and processing, with a meticulous attention to detail. You believe that great models start with great data. Demonstrated ability to perform effective hyperparameter searches and understand the trade offs involved in tuning models for specific tasks. Experience working with and building on top of existing training codebases. You are comfortable navigating complex code and contributing to its improvement. Strong attention to detail in evaluating model checkpoints to ensure they meet strict quality, performance, and reliability standards. Experience with Python and PyTorch. Familiarity with SLURM and/or Kubernetes clusters and experience submitting and managing jobs in a high performance computing environment. Familiarity with modern LLMs and generative models. Basic understanding of distributed training frameworks (e.g., FSDP, DeepSpeed). Bachelor's, Master's degree, or Ph.D. in Computer Science, Computer Engineering, or a related field, or equivalent practical experience. About Together AI Together AI is a research driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co designing software, hardware, algorithms, and models. We have contributed to leading open source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, ATLAS, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full time position is: $190,000 - $270,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.
About the Role Together AI is building the Inference Platform that powers the world's most advanced generative AI models. Your role will be a critical bridge between cutting edge research and real world applications, focusing on making translating our internal model training research to production ready deployment for our customers. This involves a deep commitment to data centric development, meticulous hyperparameter tuning, and rigorous checkpoint evaluation before models ever hit production. This role will involve understanding customer specific needs and fine tuning models on our internal data recipe and their proprietary data. The goal is to transform general purpose models into highly performant, specialized tools that solve real business problems. You will not be training foundation models from scratch but rather focusing on creating highly efficient, specialized models by working with dedicated GPU clusters. Responsibilities Design and iterate on novel speculator algorithms, combining architectural innovations with carefully curated data to push the frontier of accuracy-efficiency tradeoffs. Be the critical link between raw data and a production ready model, seeing your work directly impact our customers' success. Work in a fast paced, high impact role at the cutting edge of generative AI. Collaborate with a team of experts dedicated to solving real world, high performance challenges. You'll collaborate directly with customers to understand their needs, and work closely with our core inference and Applied ML research teams to integrate your work into the production platform. A culture of deep technical ownership where you are empowered to take on and solve challenging problems. Requirements A genuine love for data curation and processing, with a meticulous attention to detail. You believe that great models start with great data. Demonstrated ability to perform effective hyperparameter searches and understand the trade offs involved in tuning models for specific tasks. Experience working with and building on top of existing training codebases. You are comfortable navigating complex code and contributing to its improvement. Strong attention to detail in evaluating model checkpoints to ensure they meet strict quality, performance, and reliability standards. Experience with Python and PyTorch. Familiarity with SLURM and/or Kubernetes clusters and experience submitting and managing jobs in a high performance computing environment. Familiarity with modern LLMs and generative models. Basic understanding of distributed training frameworks (e.g., FSDP, DeepSpeed). Bachelor's, Master's degree, or Ph.D. in Computer Science, Computer Engineering, or a related field, or equivalent practical experience. About Together AI Together AI is a research driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co designing software, hardware, algorithms, and models. We have contributed to leading open source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, ATLAS, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full time position is: $190,000 - $270,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.
04/02/2026
Full time
About the Role Together AI is building the Inference Platform that powers the world's most advanced generative AI models. Your role will be a critical bridge between cutting edge research and real world applications, focusing on making translating our internal model training research to production ready deployment for our customers. This involves a deep commitment to data centric development, meticulous hyperparameter tuning, and rigorous checkpoint evaluation before models ever hit production. This role will involve understanding customer specific needs and fine tuning models on our internal data recipe and their proprietary data. The goal is to transform general purpose models into highly performant, specialized tools that solve real business problems. You will not be training foundation models from scratch but rather focusing on creating highly efficient, specialized models by working with dedicated GPU clusters. Responsibilities Design and iterate on novel speculator algorithms, combining architectural innovations with carefully curated data to push the frontier of accuracy-efficiency tradeoffs. Be the critical link between raw data and a production ready model, seeing your work directly impact our customers' success. Work in a fast paced, high impact role at the cutting edge of generative AI. Collaborate with a team of experts dedicated to solving real world, high performance challenges. You'll collaborate directly with customers to understand their needs, and work closely with our core inference and Applied ML research teams to integrate your work into the production platform. A culture of deep technical ownership where you are empowered to take on and solve challenging problems. Requirements A genuine love for data curation and processing, with a meticulous attention to detail. You believe that great models start with great data. Demonstrated ability to perform effective hyperparameter searches and understand the trade offs involved in tuning models for specific tasks. Experience working with and building on top of existing training codebases. You are comfortable navigating complex code and contributing to its improvement. Strong attention to detail in evaluating model checkpoints to ensure they meet strict quality, performance, and reliability standards. Experience with Python and PyTorch. Familiarity with SLURM and/or Kubernetes clusters and experience submitting and managing jobs in a high performance computing environment. Familiarity with modern LLMs and generative models. Basic understanding of distributed training frameworks (e.g., FSDP, DeepSpeed). Bachelor's, Master's degree, or Ph.D. in Computer Science, Computer Engineering, or a related field, or equivalent practical experience. About Together AI Together AI is a research driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co designing software, hardware, algorithms, and models. We have contributed to leading open source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, ATLAS, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full time position is: $190,000 - $270,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.
As a Senior Network Engineer at Together, you are responsible for designing, implementing, and maintaining our network infrastructure to ensure seamless connectivity and optimal performance for all user-facing services and production systems. As both a strategic planner and a hands on engineer, you apply sound networking principles, operational discipline, and advanced automation to our network environments. You specialize in networking systems-including routing, switching, network security, and protocols-implementing best practices for availability, reliability, and scalability. You have a keen interest in network design, optimization, and emerging technologies in HPC based data center networking. Outstanding problem solving abilities and a comprehensive understanding of fundamental network theory are also critical to your success. Requirements 8+ years of professional experience building, managing, and supporting large scale hybrid data center networks (excluding enterprise networks). High level of proficiency with TCP/IP networking architecture and technologies such as BGP, OSPF, VXLAN, EVPN, and QoS. Experience developing network automation pipelines using Python, Ansible, or other languages/tools utilized in infrastructure automation. Proficient in using tools such as Wireshark, tcpdump, nmap, MTR, and curl to identify connectivity issues, latency problems, and network bottlenecks. Experience designing and supporting multi tenant networks Hands on experience deploying and supporting network devices from Cisco, Arista, Juniper, and Mellanox. Experience working with cloud networks such as AWS, GCP, and Azure. Solid experience working in and troubleshooting within a Linux environment. Responsibilities Design, deploy, manage and maintain global multi vendor, multi protocol high performance compute networks. Analyze data to diagnose and identify root causes to network issues to minimize downtime. Evaluate and recommend network technologies, hardware, and software solutions. Participate in design reviews to ensure the proposed network architecture aligns with business needs and is optimized for performance, scalability, and reliability. Manage relationships with external vendors and partners to test and verify hardware and software selections. Develop, and deploy systems and tools to keep all networks running reliably and efficiently. Establish and implement industry best practices and contribute to the design of new scalable network solutions. Ensure compliance with IT governance standards and best practices. Lead projects to address complex technical challenges, directly contributing to roadmaps and partner alongside the best engineers in the industry to develop world class solutions. Preferred Knowledge of RoCE and Infiniband protocols a plus. Experience with Docker, Kubernetes, or Slurm a plus. Understanding of AI training workloads and the demands they exert on networks a plus. About Together AI Together AI is a research driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co designing software, hardware, algorithms, and models. We have contributed to leading open source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full time position is: $190,000 - $250,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.
04/02/2026
Full time
As a Senior Network Engineer at Together, you are responsible for designing, implementing, and maintaining our network infrastructure to ensure seamless connectivity and optimal performance for all user-facing services and production systems. As both a strategic planner and a hands on engineer, you apply sound networking principles, operational discipline, and advanced automation to our network environments. You specialize in networking systems-including routing, switching, network security, and protocols-implementing best practices for availability, reliability, and scalability. You have a keen interest in network design, optimization, and emerging technologies in HPC based data center networking. Outstanding problem solving abilities and a comprehensive understanding of fundamental network theory are also critical to your success. Requirements 8+ years of professional experience building, managing, and supporting large scale hybrid data center networks (excluding enterprise networks). High level of proficiency with TCP/IP networking architecture and technologies such as BGP, OSPF, VXLAN, EVPN, and QoS. Experience developing network automation pipelines using Python, Ansible, or other languages/tools utilized in infrastructure automation. Proficient in using tools such as Wireshark, tcpdump, nmap, MTR, and curl to identify connectivity issues, latency problems, and network bottlenecks. Experience designing and supporting multi tenant networks Hands on experience deploying and supporting network devices from Cisco, Arista, Juniper, and Mellanox. Experience working with cloud networks such as AWS, GCP, and Azure. Solid experience working in and troubleshooting within a Linux environment. Responsibilities Design, deploy, manage and maintain global multi vendor, multi protocol high performance compute networks. Analyze data to diagnose and identify root causes to network issues to minimize downtime. Evaluate and recommend network technologies, hardware, and software solutions. Participate in design reviews to ensure the proposed network architecture aligns with business needs and is optimized for performance, scalability, and reliability. Manage relationships with external vendors and partners to test and verify hardware and software selections. Develop, and deploy systems and tools to keep all networks running reliably and efficiently. Establish and implement industry best practices and contribute to the design of new scalable network solutions. Ensure compliance with IT governance standards and best practices. Lead projects to address complex technical challenges, directly contributing to roadmaps and partner alongside the best engineers in the industry to develop world class solutions. Preferred Knowledge of RoCE and Infiniband protocols a plus. Experience with Docker, Kubernetes, or Slurm a plus. Understanding of AI training workloads and the demands they exert on networks a plus. About Together AI Together AI is a research driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co designing software, hardware, algorithms, and models. We have contributed to leading open source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full time position is: $190,000 - $250,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.
Senior Software Engineer - Together Cloud Infrastructure Together AI is building the AI Acceleration Cloud, an end-to-end platform for the full generative AI lifecycle, combining the fastest LLM inference engine with state-of-the-art AI cloud infrastructure. As a Senior AI Infrastructure Engineer, you will play a key role in building the next generation AI cloud platform - a highly available, global, blazing-fast cloud infrastructure that virtualizes cutting-edge ML hardware (GB200s/GB300s, BlueField DPUs) and enables state-of-the-art ML practitioners with self-serve AI cloud services, such as on-demand + managed Kubernetes and Slurm clusters. This platform serves both our internal SaaS products (inference, fine-tuning) and our external cloud customers, spanning dozens of data centers across the world. Some of what you'll work on: Design, build, and maintain performant, secure, and highly-available backend services/operators that run in our data centers and automate hardware management, such as Infiniband partitioning, in-DC parallel storage provisioning, and VM provisioning. Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs. Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining. Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining. To be successful, you'll need to be deeply technical and possess excellent communication, collaboration, and diplomacy skills. You have strong fundamental software development skills. In addition, you have strong systems knowledge and troubleshooting abilities. Requirements 5+ years of professional software development experience and proficiency in at least one backend programming language (Golang desired) 5+ years experience writing high-performance, well-tested, production quality code Demonstrated experience with building and operating high-performance and/or globally distributed micro-service architectures across one or more cloud providers (AWS, Azure, GCP) Excellent communication skills - able to write clear design docs and work effectively with both technical and non-technical team members Deep experience with Kubernetes internals a big plus, such as implementing non-trivial Kubernetes operators, device/storage/network plugins, custom schedulers, or patches thereon or Kubernetes itself Deep experience with VMs/hypervisors a big plus, such as QEMU/KVM, cloud-hypervisor, VFIO, virtio, PCIE passthrough, Kubevirt, SR-IOV Deep experience with DC networking tech + solutions a big plus, such as VLAN, VXLAN, VPN, VPC, OVS/OVN Experience with Cluster API or similar a big plus Experience working on high-performance compute, networking, and/or storage a big plus Experience virtualizing GPUs and/or Infiniband a big plus Strong systems knowledge across compute, networking, and storage, including concurrency, memory management, performant I/O, and scale Experience with infrastructure automation tools (Terraform, Ansible), monitoring/observability stacks (Prometheus, Grafana), and CI/CD pipelines (GitHub Actions, ArgoCD) Experience building IaaS or PaaS systems at scale a plus Experience with DPUs/SmartNICs a plus GPU programming, NCCL, CUDA knowledge a plus Responsibilities Perform architecture and research work for decentralized AI workloads Work on the core, open-source Together AI platform Create services, tools, and developer documentation Create testing frameworks for robustness and fault-tolerance About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Accepted file types: pdf, doc, docx, txt, rtf Enter manually Accepted file types: pdf, doc, docx, txt, rtf LinkedIn Profile Are you able to work 4 days per week in our SF office?
04/02/2026
Full time
Senior Software Engineer - Together Cloud Infrastructure Together AI is building the AI Acceleration Cloud, an end-to-end platform for the full generative AI lifecycle, combining the fastest LLM inference engine with state-of-the-art AI cloud infrastructure. As a Senior AI Infrastructure Engineer, you will play a key role in building the next generation AI cloud platform - a highly available, global, blazing-fast cloud infrastructure that virtualizes cutting-edge ML hardware (GB200s/GB300s, BlueField DPUs) and enables state-of-the-art ML practitioners with self-serve AI cloud services, such as on-demand + managed Kubernetes and Slurm clusters. This platform serves both our internal SaaS products (inference, fine-tuning) and our external cloud customers, spanning dozens of data centers across the world. Some of what you'll work on: Design, build, and maintain performant, secure, and highly-available backend services/operators that run in our data centers and automate hardware management, such as Infiniband partitioning, in-DC parallel storage provisioning, and VM provisioning. Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs. Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining. Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining. To be successful, you'll need to be deeply technical and possess excellent communication, collaboration, and diplomacy skills. You have strong fundamental software development skills. In addition, you have strong systems knowledge and troubleshooting abilities. Requirements 5+ years of professional software development experience and proficiency in at least one backend programming language (Golang desired) 5+ years experience writing high-performance, well-tested, production quality code Demonstrated experience with building and operating high-performance and/or globally distributed micro-service architectures across one or more cloud providers (AWS, Azure, GCP) Excellent communication skills - able to write clear design docs and work effectively with both technical and non-technical team members Deep experience with Kubernetes internals a big plus, such as implementing non-trivial Kubernetes operators, device/storage/network plugins, custom schedulers, or patches thereon or Kubernetes itself Deep experience with VMs/hypervisors a big plus, such as QEMU/KVM, cloud-hypervisor, VFIO, virtio, PCIE passthrough, Kubevirt, SR-IOV Deep experience with DC networking tech + solutions a big plus, such as VLAN, VXLAN, VPN, VPC, OVS/OVN Experience with Cluster API or similar a big plus Experience working on high-performance compute, networking, and/or storage a big plus Experience virtualizing GPUs and/or Infiniband a big plus Strong systems knowledge across compute, networking, and storage, including concurrency, memory management, performant I/O, and scale Experience with infrastructure automation tools (Terraform, Ansible), monitoring/observability stacks (Prometheus, Grafana), and CI/CD pipelines (GitHub Actions, ArgoCD) Experience building IaaS or PaaS systems at scale a plus Experience with DPUs/SmartNICs a plus GPU programming, NCCL, CUDA knowledge a plus Responsibilities Perform architecture and research work for decentralized AI workloads Work on the core, open-source Together AI platform Create services, tools, and developer documentation Create testing frameworks for robustness and fault-tolerance About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Accepted file types: pdf, doc, docx, txt, rtf Enter manually Accepted file types: pdf, doc, docx, txt, rtf LinkedIn Profile Are you able to work 4 days per week in our SF office?
Senior Software Engineer - Together Cloud Platform Together AI is building the AI Acceleration Cloud, an end-to-end platform for the full generative AI lifecycle, combining the fastest LLM inference engine with state-of-the-art AI cloud infrastructure. As a Senior Backend Engineer, you will play a key role in building the next generation AI cloud platform - a highly available, global, blazing-fast cloud infrastructure that virtualizes cutting-edge ML hardware (GB200s/GB300s, BlueField DPUs) and enables state-of-the-art ML practitioners with self-serve AI cloud services, such as on-demand + managed Kubernetes and Slurm clusters. This platform serves both our internal StaaS products (inference, fine-tuning) and our external cloud customers, spanning dozens of data centers across the world. Some of what you'll work on: Work on a distributed GPU scheduling system for the on-demand clusters product, Instant Clusters. Build out a global management plane for managing our data center compute, networking, and storage. Design and build new customer-facing cloud platform services, delivering killer enterprise AI cloud features. Required Qualifications 5+ years of demonstrated experience in building large scale, fault tolerant, distributed systems and API microservices Experience designing, analyzing and improving efficiency, scalability, and stability of various system resources Excellent communication skills - able to write clear design docs and work effectively with both technical and non-technical team members Demonstrated experience with building and operating high-performance and/or globally distributed microservice architectures across one or more cloud providers (AWS, Azure, GCP) Strong systems knowledge across compute, networking, and storage, including concurrency, memory management, performant I/O, and scale Experience developing against and managing a relational database, such as PostgreSQL Expert-level programmer in one or more of programming language (Golang preferred) Proficiency in version control practices and integrating IaC with CI/CD pipelines. Experience with Kubernetes and containers preferred Experience building and operating data infrastructure (Kinesis, Airflow, Kafka, etc) a plus Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related technical field, or equivalent practical experience Key Responsibilities Identify, design, and develop foundational backend services that power Together's commerce platform Analyze and improve the robustness and scalability of existing distributed systems, APIs, databases, and infrastructure Partner with product teams to understand functional requirements and deliver solutions that meet business needs Write clear, well-tested, and maintainable software and IaC for both new and existing systems Conduct design and code reviews, create developer documentation, and develop testing strategies for robustness and fault tolerance Participate in an on-call rotation to address critical incidents when necessary About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Accepted file types: pdf, doc, docx, txt, rtf Enter manually Accepted file types: pdf, doc, docx, txt, rtf LinkedIn Profile Are you able to work 4 days per week in our SF office?
04/02/2026
Full time
Senior Software Engineer - Together Cloud Platform Together AI is building the AI Acceleration Cloud, an end-to-end platform for the full generative AI lifecycle, combining the fastest LLM inference engine with state-of-the-art AI cloud infrastructure. As a Senior Backend Engineer, you will play a key role in building the next generation AI cloud platform - a highly available, global, blazing-fast cloud infrastructure that virtualizes cutting-edge ML hardware (GB200s/GB300s, BlueField DPUs) and enables state-of-the-art ML practitioners with self-serve AI cloud services, such as on-demand + managed Kubernetes and Slurm clusters. This platform serves both our internal StaaS products (inference, fine-tuning) and our external cloud customers, spanning dozens of data centers across the world. Some of what you'll work on: Work on a distributed GPU scheduling system for the on-demand clusters product, Instant Clusters. Build out a global management plane for managing our data center compute, networking, and storage. Design and build new customer-facing cloud platform services, delivering killer enterprise AI cloud features. Required Qualifications 5+ years of demonstrated experience in building large scale, fault tolerant, distributed systems and API microservices Experience designing, analyzing and improving efficiency, scalability, and stability of various system resources Excellent communication skills - able to write clear design docs and work effectively with both technical and non-technical team members Demonstrated experience with building and operating high-performance and/or globally distributed microservice architectures across one or more cloud providers (AWS, Azure, GCP) Strong systems knowledge across compute, networking, and storage, including concurrency, memory management, performant I/O, and scale Experience developing against and managing a relational database, such as PostgreSQL Expert-level programmer in one or more of programming language (Golang preferred) Proficiency in version control practices and integrating IaC with CI/CD pipelines. Experience with Kubernetes and containers preferred Experience building and operating data infrastructure (Kinesis, Airflow, Kafka, etc) a plus Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related technical field, or equivalent practical experience Key Responsibilities Identify, design, and develop foundational backend services that power Together's commerce platform Analyze and improve the robustness and scalability of existing distributed systems, APIs, databases, and infrastructure Partner with product teams to understand functional requirements and deliver solutions that meet business needs Write clear, well-tested, and maintainable software and IaC for both new and existing systems Conduct design and code reviews, create developer documentation, and develop testing strategies for robustness and fault tolerance Participate in an on-call rotation to address critical incidents when necessary About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Accepted file types: pdf, doc, docx, txt, rtf Enter manually Accepted file types: pdf, doc, docx, txt, rtf LinkedIn Profile Are you able to work 4 days per week in our SF office?
A leading AI company is seeking a Senior Backend Engineer to shape and scale its commerce platform. You will work on backend services that drive mission-critical commerce capabilities including billing and payment processing. The ideal candidate has over 5 years of experience in distributed systems and API microservices, and is proficient in languages such as Golang and Python. This full-time role offers competitive compensation ranging from $160,000 to $250,000, equity, and health benefits.
04/02/2026
Full time
A leading AI company is seeking a Senior Backend Engineer to shape and scale its commerce platform. You will work on backend services that drive mission-critical commerce capabilities including billing and payment processing. The ideal candidate has over 5 years of experience in distributed systems and API microservices, and is proficient in languages such as Golang and Python. This full-time role offers competitive compensation ranging from $160,000 to $250,000, equity, and health benefits.
A research-driven AI company is seeking a Senior Network Engineer responsible for designing and maintaining network infrastructure. The role requires a minimum of 8 years of experience with hybrid data center networks and proficiency in protocols such as BGP and OSPF. The ideal candidate will also have hands-on experience with networking tools and cloud networks. The position offers a competitive salary ranging from $190,000 to $250,000, equity options, and additional benefits, making it an excellent opportunity for networking experts.
04/02/2026
Full time
A research-driven AI company is seeking a Senior Network Engineer responsible for designing and maintaining network infrastructure. The role requires a minimum of 8 years of experience with hybrid data center networks and proficiency in protocols such as BGP and OSPF. The ideal candidate will also have hands-on experience with networking tools and cloud networks. The position offers a competitive salary ranging from $190,000 to $250,000, equity options, and additional benefits, making it an excellent opportunity for networking experts.
Role As a Systems Research Engineer specialized in GPU Programming, you will play a crucial role in developing and optimizing GPU-accelerated kernels and algorithms for ML/AI applications. Working closely with the modeling and algorithm team, you will co-design GPU kernels and model architecture to enhance the performance and efficiency of our AI systems. Collaborating with the hardware and software teams, you will contribute to the co-design of efficient GPU architectures and programming models, leveraging your expertise in GPU programming and parallel computing. Your research skills will be vital in staying up-to-date with the latest advancements in GPU programming techniques, ensuring that our AI infrastructure remains at the forefront of innovation. Requirements Strong background in GPU programming and parallel computing, such as CUDA and/or Triton. Knowledge of ML/AI applications and models Knowledge of performance profiling and optimization tools for GPU programming Excellent problem-solving and analytical skills Bachelor's, Master's, or Ph.D. degree in Computer Science, Electrical Engineering, or equivalent practical experiences Responsibilities Optimize and fine-tune GPU code to achieve better performance and scalability Collaborate with cross-functional teams to integrate GPU-accelerated solutions into existing software systems Stay up-to-date with the latest advancements in GPU programming techniques and technologies About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at
04/02/2026
Full time
Role As a Systems Research Engineer specialized in GPU Programming, you will play a crucial role in developing and optimizing GPU-accelerated kernels and algorithms for ML/AI applications. Working closely with the modeling and algorithm team, you will co-design GPU kernels and model architecture to enhance the performance and efficiency of our AI systems. Collaborating with the hardware and software teams, you will contribute to the co-design of efficient GPU architectures and programming models, leveraging your expertise in GPU programming and parallel computing. Your research skills will be vital in staying up-to-date with the latest advancements in GPU programming techniques, ensuring that our AI infrastructure remains at the forefront of innovation. Requirements Strong background in GPU programming and parallel computing, such as CUDA and/or Triton. Knowledge of ML/AI applications and models Knowledge of performance profiling and optimization tools for GPU programming Excellent problem-solving and analytical skills Bachelor's, Master's, or Ph.D. degree in Computer Science, Electrical Engineering, or equivalent practical experiences Responsibilities Optimize and fine-tune GPU code to achieve better performance and scalability Collaborate with cross-functional teams to integrate GPU-accelerated solutions into existing software systems Stay up-to-date with the latest advancements in GPU programming techniques and technologies About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at