Browse IT Jobs | IT Job Board

Machine Learning System Software Architect

Baidu Sunnyvale, California

Overview Sunnyvale,CA We're looking forward to you joining us to collaborate, contribute, and revolutionize AI silicon and system. We're seeking a world-class Machine Learning System Software Architect to join our SoC team at Baidu's Sunnyvale office. The successful candidate will be a motivated self-starter who will thrive in this highly technical environment. Your job responsibilities will help the team to architect and create high-performance machine learning system software and build the distributed AI training system by connecting thousands of Kunlun Accelerators and servers. Responsibilities Create differentiated architectural innovations for Baidu's Kunlun AI SoC roadmap. Architect, simulate, and design machine learning solutions for our AI products. Develop system-level ML architectures that push performance, power, and latency boundaries; collaborate with teammates to optimize hardware and software for maximum performance. Monitor industrial and academic trends in artificial intelligence and determine intersections with roadmaps. Drive partnerships for access to advanced AI technologies. Evaluate the power, performance, and cost of prospective architecture and subsystems. Build scalable tools for modeling and performance evaluation. Engage with system and application software engineers to optimize the entire hardware/software stack. Work with SoC design, verification, and validation engineers to execute the architecture. Qualifications Knowledge of Machine Learning market, technological and business trends, software ecosystem, and emerging applications. Proven track record of 5+ years architecting software solutions for Machine Learning, acceleration and optimization, especially in large distributed training systems and HPC. Experience with deep learning frameworks: TensorFlow/PyTorch/PaddlePaddle, etc. Strong track record of outreach to ML researchers and application developers. Experience with CPUs, GPUs, memory systems, and accelerators. Experience with performance simulation and modeling in C++. MS or PhD in Electrical or Computer Engineering. Excellent communication skills in both English and Chinese. Culture Fit Mission alignment: We provide the best possible platform to accomplish this great mission. Self-directed: We work best with people who are driven, motivated, and aspire to greatness. Hungry to learn: We are eager to see you learn new skills and grow. Team orientation: We work in small, fast-moving teams and pursue big goals together. LI-DNI Apply for this job Interested in building your career at Baidu USA? Please apply through Baidu USA careers channels.

04/04/2026

Full time

Overview Sunnyvale,CA We're looking forward to you joining us to collaborate, contribute, and revolutionize AI silicon and system. We're seeking a world-class Machine Learning System Software Architect to join our SoC team at Baidu's Sunnyvale office. The successful candidate will be a motivated self-starter who will thrive in this highly technical environment. Your job responsibilities will help the team to architect and create high-performance machine learning system software and build the distributed AI training system by connecting thousands of Kunlun Accelerators and servers. Responsibilities Create differentiated architectural innovations for Baidu's Kunlun AI SoC roadmap. Architect, simulate, and design machine learning solutions for our AI products. Develop system-level ML architectures that push performance, power, and latency boundaries; collaborate with teammates to optimize hardware and software for maximum performance. Monitor industrial and academic trends in artificial intelligence and determine intersections with roadmaps. Drive partnerships for access to advanced AI technologies. Evaluate the power, performance, and cost of prospective architecture and subsystems. Build scalable tools for modeling and performance evaluation. Engage with system and application software engineers to optimize the entire hardware/software stack. Work with SoC design, verification, and validation engineers to execute the architecture. Qualifications Knowledge of Machine Learning market, technological and business trends, software ecosystem, and emerging applications. Proven track record of 5+ years architecting software solutions for Machine Learning, acceleration and optimization, especially in large distributed training systems and HPC. Experience with deep learning frameworks: TensorFlow/PyTorch/PaddlePaddle, etc. Strong track record of outreach to ML researchers and application developers. Experience with CPUs, GPUs, memory systems, and accelerators. Experience with performance simulation and modeling in C++. MS or PhD in Electrical or Computer Engineering. Excellent communication skills in both English and Chinese. Culture Fit Mission alignment: We provide the best possible platform to accomplish this great mission. Self-directed: We work best with people who are driven, motivated, and aspire to greatness. Hungry to learn: We are eager to see you learn new skills and grow. Team orientation: We work in small, fast-moving teams and pursue big goals together. LI-DNI Apply for this job Interested in building your career at Baidu USA? Please apply through Baidu USA careers channels.

Senior Software Architect - Deep Learning and HPC Communications

Nvidia Santa Clara, California

NVIDIA is leading groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU-our invention-serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables groundbreaking creativity and discovery, and powers inventions that were once considered science fiction, including artificial intelligence to autonomous cars. We are the GPU Communications Libraries and Networking team at NVIDIA. We build communication libraries like NCCL, NVSHMEM, and UCX that are crucial for scaling Deep Learning and HPC. We're seeking a Senior Software Architect to help co design next gen data center platforms and scalable communications software. DL and HPC applications have a huge compute demand and already run at scales of up to tens of thousands of GPUs. GPUs are connected with high speed interconnects (e.g. NVLink, PCIe) within a node and with high speed networking (e.g. InfiniBand, Ethernet) across nodes. Efficient and fast communication between GPUs directly impacts end to end application performance. This impact continues to grow with the increasing scale of next generation systems. This is an outstanding opportunity to advance the state of the art, break performance barriers, and deliver platforms the world has never seen before. Are you ready to build the new and innovative technologies that will help realize NVIDIA's vision? What you will be doing: Investigate opportunities to improve communication performance by identifying bottlenecks in today's systems. Design and implement new communication technologies to accelerate AI and HPC workloads. Explore innovative solutions in HW and SW for our next generation platforms as part of co design efforts involving GPU, Networking, and SW architects. Build proofs of concept, conduct experiments, and perform quantitative modeling to evaluate and drive new innovations. Use simulation to explore performance of large GPU clusters (think scales of 100s of 1000s of GPUs). What we need to see: M.S./Ph.D. degree in CS/CE or equivalent experience. 5+ years of relevant experience. Excellent C/C++ programming and debugging skills. Experience with parallel programming models (MPI, SHMEM) and at least one communication runtime (MPI, NCCL, NVSHMEM, OpenSHMEM, UCX, UCC). Deep understanding of operating systems, computer and system architecture. Solid fundamentals of network architecture, topology, algorithms, and communication scaling relevant to AI and HPC workloads. Strong experience with Linux. Ability and flexibility to work and communicate effectively in a multi national, multi time zone corporate environment. Ways to stand out from the crowd: Expertise in related technology and passion for what you do. Experience with CUDA programming and NVIDIA GPUs. Knowledge of high performance networks like Infini, RoCE, NVLink, etc. Experience with Deep Learning Frameworks such as PyTorch, TensorFlow, etc. Knowledge of deep learning parallelisms and mapping to the communication subsystem. Experience with HPC applications. Strong collaborative and interpersonal skills and a proven track record of effectively guiding and influencing within a dynamic and multi functional environment. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until December 10, 2025. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

04/02/2026

Full time

NVIDIA is leading groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU-our invention-serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables groundbreaking creativity and discovery, and powers inventions that were once considered science fiction, including artificial intelligence to autonomous cars. We are the GPU Communications Libraries and Networking team at NVIDIA. We build communication libraries like NCCL, NVSHMEM, and UCX that are crucial for scaling Deep Learning and HPC. We're seeking a Senior Software Architect to help co design next gen data center platforms and scalable communications software. DL and HPC applications have a huge compute demand and already run at scales of up to tens of thousands of GPUs. GPUs are connected with high speed interconnects (e.g. NVLink, PCIe) within a node and with high speed networking (e.g. InfiniBand, Ethernet) across nodes. Efficient and fast communication between GPUs directly impacts end to end application performance. This impact continues to grow with the increasing scale of next generation systems. This is an outstanding opportunity to advance the state of the art, break performance barriers, and deliver platforms the world has never seen before. Are you ready to build the new and innovative technologies that will help realize NVIDIA's vision? What you will be doing: Investigate opportunities to improve communication performance by identifying bottlenecks in today's systems. Design and implement new communication technologies to accelerate AI and HPC workloads. Explore innovative solutions in HW and SW for our next generation platforms as part of co design efforts involving GPU, Networking, and SW architects. Build proofs of concept, conduct experiments, and perform quantitative modeling to evaluate and drive new innovations. Use simulation to explore performance of large GPU clusters (think scales of 100s of 1000s of GPUs). What we need to see: M.S./Ph.D. degree in CS/CE or equivalent experience. 5+ years of relevant experience. Excellent C/C++ programming and debugging skills. Experience with parallel programming models (MPI, SHMEM) and at least one communication runtime (MPI, NCCL, NVSHMEM, OpenSHMEM, UCX, UCC). Deep understanding of operating systems, computer and system architecture. Solid fundamentals of network architecture, topology, algorithms, and communication scaling relevant to AI and HPC workloads. Strong experience with Linux. Ability and flexibility to work and communicate effectively in a multi national, multi time zone corporate environment. Ways to stand out from the crowd: Expertise in related technology and passion for what you do. Experience with CUDA programming and NVIDIA GPUs. Knowledge of high performance networks like Infini, RoCE, NVLink, etc. Experience with Deep Learning Frameworks such as PyTorch, TensorFlow, etc. Knowledge of deep learning parallelisms and mapping to the communication subsystem. Experience with HPC applications. Strong collaborative and interpersonal skills and a proven track record of effectively guiding and influencing within a dynamic and multi functional environment. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until December 10, 2025. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

AI/HPC System Architect

SK hynix America

Job Title: AIHPC System Architect, Principal Engineer Office Location: San Jose, CA Job Type: Full-Time Work Model: Onsite About SK hynix America At SK hynix America, we're at the forefront of semiconductor innovation, developing advanced memory solutions that power everything from smartphones to data centers. As a global leader in DRAM and NAND flash technologies, we drive the evolution of advancing mobile technology, empowering cloud computing, and pioneering future technologies. Our cutting-edge memory technologies are essential in today's most advanced electronic devices and IT infrastructure, enabling enhanced performance and user experiences across the digital landscape. We're looking for innovative minds to join our mission of shaping the future of technology. At SK hynix America, you'll be part of a team that's pioneering breakthrough memory solutions while maintaining a strong commitment to sustainability. We're not just adapting to technological change - we're driving it, with significant investments in artificial intelligence, machine learning, and eco-friendly solutions and operational practices. As we continue to expand our market presence and push the boundaries of what's possible in semiconductor technology, we invite you to be part of our journey to creating the next generation of memory solutions that will define the future of computing. Job Summary: As an AI/HPC System Architect, your primary responsibility will be to conceptualize the architecture of next-generation AI and HPC system. This involves leveraging advanced memory to create efficient and high-performance systems. Drawing from your extensive experience in AI and HPC infrastructure design, you will derive design requirements, guide the direction of advanced memory solutions, and continuously monitor industry trends. Additionally, you will foster collaborations with cloud service providers, OEMs, national labs, and other stakeholders to facilitate on-site proof-of-concept initiatives aligned with the proposed infrastructure architecture. Responsibilities: Design next-generation AI and HPC infrastructure that utilizes advanced memory solutions to optimize performance and power efficiency. Derive design requirements for advanced memory solutions based on your extensive experience in AI and HPC infrastructure design. Continuously monitor industry trends and emerging technologies related to AI and HPC infrastructure from sources such as cloud service providers, OEMs, and national labs. Promote and establish collaborative partnerships with customers for on-site proof-of-concept initiatives involving advanced memory solutions, aligning with the proposed infrastructure architecture. Provide strategic direction and recommendations for the adoption and integration of advanced memory solutions within the infrastructure. Qualification: Ph.D in Electrical and Computer Engineering, or related field with 15+ years of experience in system architecture for large-scale computing, with specific AI focus preferred. Strong understanding of compute, memory, and networking bottlenecks in AI systems. Deep expertise in AI system design, and knowledge in LLMs, GNN, DLRMs, and other representative AI workloads. Background in hardware/software co-design for AI or HPC systems Effective collaboration and partnership-building skills for engaging with customers and stakeholders. Experience in planning and executing on-site proof-of-concept initiatives to validate infrastructure solutions. Excellent communication skills to convey architectural concepts and influence stakeholders. SKHYA is an Equal Employment Opportunity Employer. We provide equal employment opportunities to all qualified applicants and employees and prohibit discrimination and harassment of any type without regard to race, sex, pregnancy, sexual orientation, religion, age, gender identity, national origin, color, protected veteran or disability status, genetic information or any other status protected under federal, state, or local applicable laws. Compensation: Our compensation reflects the cost of labor across several U.S. geographic markets, and we pay differently based on those defined markets. Pay within the provided range varies by work location and may also depend on job-related skills and experience. Your Recruiter can share more about the specific salary range for the job location during the hiring process. Pay Range $180,000 - $240,000 USD Apply for this job indicates a required field First Name Last Name Email Phone Resume/CV Enter manually Accepted file types: pdf, doc, docx, txt, rtf Enter manually Accepted file types: pdf, doc, docx, txt, rtf Education School Select Degree Select Select Select Start date year End date month Select End date year Are you authorized to work in US? Select Will you now or in the future require sponsorship for employment visa status (e.g., H-1B visa)? Select

04/02/2026

Full time

Job Title: AIHPC System Architect, Principal Engineer Office Location: San Jose, CA Job Type: Full-Time Work Model: Onsite About SK hynix America At SK hynix America, we're at the forefront of semiconductor innovation, developing advanced memory solutions that power everything from smartphones to data centers. As a global leader in DRAM and NAND flash technologies, we drive the evolution of advancing mobile technology, empowering cloud computing, and pioneering future technologies. Our cutting-edge memory technologies are essential in today's most advanced electronic devices and IT infrastructure, enabling enhanced performance and user experiences across the digital landscape. We're looking for innovative minds to join our mission of shaping the future of technology. At SK hynix America, you'll be part of a team that's pioneering breakthrough memory solutions while maintaining a strong commitment to sustainability. We're not just adapting to technological change - we're driving it, with significant investments in artificial intelligence, machine learning, and eco-friendly solutions and operational practices. As we continue to expand our market presence and push the boundaries of what's possible in semiconductor technology, we invite you to be part of our journey to creating the next generation of memory solutions that will define the future of computing. Job Summary: As an AI/HPC System Architect, your primary responsibility will be to conceptualize the architecture of next-generation AI and HPC system. This involves leveraging advanced memory to create efficient and high-performance systems. Drawing from your extensive experience in AI and HPC infrastructure design, you will derive design requirements, guide the direction of advanced memory solutions, and continuously monitor industry trends. Additionally, you will foster collaborations with cloud service providers, OEMs, national labs, and other stakeholders to facilitate on-site proof-of-concept initiatives aligned with the proposed infrastructure architecture. Responsibilities: Design next-generation AI and HPC infrastructure that utilizes advanced memory solutions to optimize performance and power efficiency. Derive design requirements for advanced memory solutions based on your extensive experience in AI and HPC infrastructure design. Continuously monitor industry trends and emerging technologies related to AI and HPC infrastructure from sources such as cloud service providers, OEMs, and national labs. Promote and establish collaborative partnerships with customers for on-site proof-of-concept initiatives involving advanced memory solutions, aligning with the proposed infrastructure architecture. Provide strategic direction and recommendations for the adoption and integration of advanced memory solutions within the infrastructure. Qualification: Ph.D in Electrical and Computer Engineering, or related field with 15+ years of experience in system architecture for large-scale computing, with specific AI focus preferred. Strong understanding of compute, memory, and networking bottlenecks in AI systems. Deep expertise in AI system design, and knowledge in LLMs, GNN, DLRMs, and other representative AI workloads. Background in hardware/software co-design for AI or HPC systems Effective collaboration and partnership-building skills for engaging with customers and stakeholders. Experience in planning and executing on-site proof-of-concept initiatives to validate infrastructure solutions. Excellent communication skills to convey architectural concepts and influence stakeholders. SKHYA is an Equal Employment Opportunity Employer. We provide equal employment opportunities to all qualified applicants and employees and prohibit discrimination and harassment of any type without regard to race, sex, pregnancy, sexual orientation, religion, age, gender identity, national origin, color, protected veteran or disability status, genetic information or any other status protected under federal, state, or local applicable laws. Compensation: Our compensation reflects the cost of labor across several U.S. geographic markets, and we pay differently based on those defined markets. Pay within the provided range varies by work location and may also depend on job-related skills and experience. Your Recruiter can share more about the specific salary range for the job location during the hiring process. Pay Range $180,000 - $240,000 USD Apply for this job indicates a required field First Name Last Name Email Phone Resume/CV Enter manually Accepted file types: pdf, doc, docx, txt, rtf Enter manually Accepted file types: pdf, doc, docx, txt, rtf Education School Select Degree Select Select Select Start date year End date month Select End date year Are you authorized to work in US? Select Will you now or in the future require sponsorship for employment visa status (e.g., H-1B visa)? Select

AI & HPC Infrastructure Architect for Scalable Clusters

Accenture San Francisco, California

A leading consulting firm is seeking an experienced professional to design and implement HPC and AI infrastructure solutions. The role involves optimizing performance and managing complex architecture across different environments. Candidates should possess significant experience in deploying XPU-based clusters and have a strong background in cloud platforms and automation. This position requires a Bachelor's degree or equivalent work experience and offers a competitive salary in California.

04/02/2026

Full time

A leading consulting firm is seeking an experienced professional to design and implement HPC and AI infrastructure solutions. The role involves optimizing performance and managing complex architecture across different environments. Candidates should possess significant experience in deploying XPU-based clusters and have a strong background in cloud platforms and automation. This position requires a Bachelor's degree or equivalent work experience and offers a competitive salary in California.

Lead GPU Datacenter Solutions Architect for AI HPC

AMD

A leading technology company is seeking a Principal Solutions Architect to enable large clusters for AI and HPC workloads. This role involves leading technical discovery, shaping reference architectures, and partnering with teams to build value models. Candidates should have deep hands-on experience with GPU-based infrastructure, strong communication skills, and a Bachelor's degree in a related field. The position is based in San Jose, CA, offering competitive benefits and an inclusive work environment.

04/02/2026

Full time

A leading technology company is seeking a Principal Solutions Architect to enable large clusters for AI and HPC workloads. This role involves leading technical discovery, shaping reference architectures, and partnering with teams to build value models. Candidates should have deep hands-on experience with GPU-based infrastructure, strong communication skills, and a Bachelor's degree in a related field. The position is based in San Jose, CA, offering competitive benefits and an inclusive work environment.

Staff/Principal Software Engineer, Control

PsiQuantum Palo Alto, California

Staff/Principal Software Engineer, Control Join to apply for the Staff/Principal Software Engineer, Control role at PsiQuantum. PsiQuantum's mission is to build the first useful quantum computers-machines capable of delivering the breakthroughs the field has long promised. Since our founding in 2016, our singular focus has been to build and deploy million-qubit, fault-tolerant quantum systems. Quantum computers harness the laws of quantum mechanics to solve problems that even the most advanced supercomputers or AI systems will never reach. Their impact will span energy, pharmaceuticals, finance, agriculture, transportation, materials, and other foundational industries. Our architecture and approach is based on silicon photonics. By leveraging the advanced semiconductor manufacturing industry-including partners like GlobalFoundries-we use the same high-volume processes that already produce billions of chips for telecom and consumer electronics. Photonics offers natural advantages for scale: photons don't feel heat, are immune to electromagnetic interference, and integrate with existing cryogenic cooling and standard fiber-optic infrastructure. In 2024, PsiQuantum announced government-funded projects to support the build-out of our first utility-scale quantum computers in Brisbane, Australia, and Chicago, Illinois. These initiatives reflect a growing recognition that quantum computing will be strategically and economically defining-and that now is the time to scale. PsiQuantum also develops the algorithms and software needed to make these systems commercially valuable. Our application, software, and industry teams work directly with leading Fortune 500 companies-including Lockheed Martin, Mercedes Benz, Boehringer Ingelheim, and Mitsubishi Chemical-to prepare quantum solutions for real-world impact. Quantum computing is not an extension of classical computing. It represents a fundamental shift-and a path to mastering challenges that cannot be solved any other way. The potential is enormous, and we have a clear path to make it real. Come join us to build the operating system for the world's first useful quantum computer! Responsibilities Develop and implement software for control of photonic quantum computers in Rust. Design and architecture of system software. Participate in design and code reviews. Collaborate across software, hardware, and research teams at PsiQuantum. Testing and maintenance of control software. Champion and serve as an exemplar of good software development practices at PsiQuantum. Experience/Qualifications 15+ years experience with Rust or C++ in a system software context. Architecture and implementation of system software for HPC, robotics, AI, quantum computing, semiconductor fabrication, or control systems. Proven track record of developing and shipping reliable, performant software in a high scale or mission critical context. Bachelors degree in a technical discipline preferred, or equivalent experience. Ways To Stand Out Experience in high performance computing. Experience in computer architecture. Experience in high performance networking. Experience building operating systems, compilers, or kernels. Experience interfacing with ASICs and/or FPGAs in complex/high-performance systems or an edge-computing context. Experience with Nix. Experience with highly scalable distributed systems in a reliability and up-time critical environment. Familiarity with quantum control. PhD in a related technical discipline is a plus, but not required. We are open to adjusting the compensation to experience level. PsiQuantum provides equal employment opportunity for all applicants and employees. PsiQuantum does not unlawfully discriminate on the basis of race, color, religion, sex (including pregnancy, childbirth, or related medical conditions), gender identity, gender expression, national origin, ancestry, citizenship, age, physical or mental disability, military or veteran status, marital status, domestic partner status, sexual orientation, genetic information, or any other basis protected by applicable laws. Note: PsiQuantum will only reach out to you using an official PsiQuantum email address and will never ask you for bank account information as part of the interview process. Please report any suspicious activity to . We are not accepting unsolicited resumes from employment agencies. U.S. Base Pay Range: $180,000-$205,000 USD. Bay Area Pay Range: $200,000-$230,000 USD.

04/02/2026

Full time

Staff/Principal Software Engineer, Control Join to apply for the Staff/Principal Software Engineer, Control role at PsiQuantum. PsiQuantum's mission is to build the first useful quantum computers-machines capable of delivering the breakthroughs the field has long promised. Since our founding in 2016, our singular focus has been to build and deploy million-qubit, fault-tolerant quantum systems. Quantum computers harness the laws of quantum mechanics to solve problems that even the most advanced supercomputers or AI systems will never reach. Their impact will span energy, pharmaceuticals, finance, agriculture, transportation, materials, and other foundational industries. Our architecture and approach is based on silicon photonics. By leveraging the advanced semiconductor manufacturing industry-including partners like GlobalFoundries-we use the same high-volume processes that already produce billions of chips for telecom and consumer electronics. Photonics offers natural advantages for scale: photons don't feel heat, are immune to electromagnetic interference, and integrate with existing cryogenic cooling and standard fiber-optic infrastructure. In 2024, PsiQuantum announced government-funded projects to support the build-out of our first utility-scale quantum computers in Brisbane, Australia, and Chicago, Illinois. These initiatives reflect a growing recognition that quantum computing will be strategically and economically defining-and that now is the time to scale. PsiQuantum also develops the algorithms and software needed to make these systems commercially valuable. Our application, software, and industry teams work directly with leading Fortune 500 companies-including Lockheed Martin, Mercedes Benz, Boehringer Ingelheim, and Mitsubishi Chemical-to prepare quantum solutions for real-world impact. Quantum computing is not an extension of classical computing. It represents a fundamental shift-and a path to mastering challenges that cannot be solved any other way. The potential is enormous, and we have a clear path to make it real. Come join us to build the operating system for the world's first useful quantum computer! Responsibilities Develop and implement software for control of photonic quantum computers in Rust. Design and architecture of system software. Participate in design and code reviews. Collaborate across software, hardware, and research teams at PsiQuantum. Testing and maintenance of control software. Champion and serve as an exemplar of good software development practices at PsiQuantum. Experience/Qualifications 15+ years experience with Rust or C++ in a system software context. Architecture and implementation of system software for HPC, robotics, AI, quantum computing, semiconductor fabrication, or control systems. Proven track record of developing and shipping reliable, performant software in a high scale or mission critical context. Bachelors degree in a technical discipline preferred, or equivalent experience. Ways To Stand Out Experience in high performance computing. Experience in computer architecture. Experience in high performance networking. Experience building operating systems, compilers, or kernels. Experience interfacing with ASICs and/or FPGAs in complex/high-performance systems or an edge-computing context. Experience with Nix. Experience with highly scalable distributed systems in a reliability and up-time critical environment. Familiarity with quantum control. PhD in a related technical discipline is a plus, but not required. We are open to adjusting the compensation to experience level. PsiQuantum provides equal employment opportunity for all applicants and employees. PsiQuantum does not unlawfully discriminate on the basis of race, color, religion, sex (including pregnancy, childbirth, or related medical conditions), gender identity, gender expression, national origin, ancestry, citizenship, age, physical or mental disability, military or veteran status, marital status, domestic partner status, sexual orientation, genetic information, or any other basis protected by applicable laws. Note: PsiQuantum will only reach out to you using an official PsiQuantum email address and will never ask you for bank account information as part of the interview process. Please report any suspicious activity to . We are not accepting unsolicited resumes from employment agencies. U.S. Base Pay Range: $180,000-$205,000 USD. Bay Area Pay Range: $200,000-$230,000 USD.

Storage Solution Architect

Supermicro

Storage Solution Architect - Supermicro Join to apply for the Storage Solution Architect role at Supermicro. Job Req ID: 28037 About Supermicro Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us. Job Summary The world's largest AI and cloud platforms are being powered by next generation storage - and this is your chance to be one of the architects behind it. As a Sr. Storage Solution Architect with Supermicro's FAE org, you'll sit at the intersection of customers, engineering, and cutting edge infrastructure, designing high impact solutions that go straight into production at massive scale. If you love owning complex problems end to end, influencing real deployments, and working with teams that move fast and build boldly, this is the kind of role people wish they joined a year earlier. Essential Duties and Responsibilities Recruits, leads and mentors team of FAE's in support of revenue growth at all assigned accounts Solution architect and presentation skills to help expand Fortune 1000 account penetration and loyalty Works closely with the sales and extended teams to map end user customer business, partner requirements and provide most optimized technical solutions Provides technical leadership in Pre Sales and oversee implementation of such solutions Effectively engage with C level technical customers, gain and maintain trust Highly technical problem solver who understands system architecture, hardware and software interaction Should possess basic HW debug skills Manage and leads product HW and SW engineers during product development Manage and leads teams during investigation and resolution of problems in the field Has a good understanding of firmware, drivers, OS, Applications and their interaction that can cause system issues Develop and review systems solutions, technical bid responses and presentations Work cross department to ensure customer satisfaction and timely resolution of issues Ability to build and demonstrate proof of concept that meets or exceeds customer business needs Ability to present the corporate brand, product messaging and solutions to customers Visit on site facilities and operations to enable solution integration and issue resolution Some travel required (up to 25%) Qualifications Bachelors' or Masters' degree in engineering discipline; computer science, computer engineering and electrical engineering preferred Excellent communication skills and business sense required Minimum 8 year related experience with Compute/Storage Server or Data Center or IT Infrastructure experience desired Strong background and experience with x86 based server architecture Solid HW system expertise and diagnostics skills Familiarity with Firmware, Linux, Windows and virtualization platforms is a plus Strong technical communication and leadership skills to lead investigations with engineers of multiple disciplines Individual must be able to work effectively in a high pressure environment Salary Range $150,000 - $185,000 The salary offered will depend on several factors, including your location, level, education, training, specific skills, years of experience, and comparison to other employees already in this role. In addition to a comprehensive benefits package, candidates may be eligible for other forms of compensation, such as participation in bonus and equity award programs. Seniority Level Mid Senior level Employment Type Full time Job Function Information Technology Industries Computer Hardware Manufacturing, Appliances, Electrical, and Electronics Manufacturing, and Computers and Electronics Manufacturing EEO Statement Supermicro is an Equal Opportunity Employer and embraces diversity in our employee population. It is the policy of Supermicro to provide equal opportunity to all qualified applicants and employees without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, protected veteran status or special disabled veteran, marital status, pregnancy, genetic information, or any other legally protected status.

04/02/2026

Full time

Storage Solution Architect - Supermicro Join to apply for the Storage Solution Architect role at Supermicro. Job Req ID: 28037 About Supermicro Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us. Job Summary The world's largest AI and cloud platforms are being powered by next generation storage - and this is your chance to be one of the architects behind it. As a Sr. Storage Solution Architect with Supermicro's FAE org, you'll sit at the intersection of customers, engineering, and cutting edge infrastructure, designing high impact solutions that go straight into production at massive scale. If you love owning complex problems end to end, influencing real deployments, and working with teams that move fast and build boldly, this is the kind of role people wish they joined a year earlier. Essential Duties and Responsibilities Recruits, leads and mentors team of FAE's in support of revenue growth at all assigned accounts Solution architect and presentation skills to help expand Fortune 1000 account penetration and loyalty Works closely with the sales and extended teams to map end user customer business, partner requirements and provide most optimized technical solutions Provides technical leadership in Pre Sales and oversee implementation of such solutions Effectively engage with C level technical customers, gain and maintain trust Highly technical problem solver who understands system architecture, hardware and software interaction Should possess basic HW debug skills Manage and leads product HW and SW engineers during product development Manage and leads teams during investigation and resolution of problems in the field Has a good understanding of firmware, drivers, OS, Applications and their interaction that can cause system issues Develop and review systems solutions, technical bid responses and presentations Work cross department to ensure customer satisfaction and timely resolution of issues Ability to build and demonstrate proof of concept that meets or exceeds customer business needs Ability to present the corporate brand, product messaging and solutions to customers Visit on site facilities and operations to enable solution integration and issue resolution Some travel required (up to 25%) Qualifications Bachelors' or Masters' degree in engineering discipline; computer science, computer engineering and electrical engineering preferred Excellent communication skills and business sense required Minimum 8 year related experience with Compute/Storage Server or Data Center or IT Infrastructure experience desired Strong background and experience with x86 based server architecture Solid HW system expertise and diagnostics skills Familiarity with Firmware, Linux, Windows and virtualization platforms is a plus Strong technical communication and leadership skills to lead investigations with engineers of multiple disciplines Individual must be able to work effectively in a high pressure environment Salary Range $150,000 - $185,000 The salary offered will depend on several factors, including your location, level, education, training, specific skills, years of experience, and comparison to other employees already in this role. In addition to a comprehensive benefits package, candidates may be eligible for other forms of compensation, such as participation in bonus and equity award programs. Seniority Level Mid Senior level Employment Type Full time Job Function Information Technology Industries Computer Hardware Manufacturing, Appliances, Electrical, and Electronics Manufacturing, and Computers and Electronics Manufacturing EEO Statement Supermicro is an Equal Opportunity Employer and embraces diversity in our employee population. It is the policy of Supermicro to provide equal opportunity to all qualified applicants and employees without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, protected veteran status or special disabled veteran, marital status, pregnancy, genetic information, or any other legally protected status.

Principal AI & HPC Datacenter Solutions Architect

AMD

This range is provided by AMD. Your actual pay will be based on your skills and experience - talk with your recruiter to learn more. The Role The AMD Datacenter GPU team is seeking an experienced Principal Solutions Architect to join our team focused on enabling large clusters for AI & HPC workloads. The Person The candidate will be a technical expert in datacenter infrastructure with deep knowledge of datacenter design, strong knowledge of compute (CPUs/GPUs), networking, and storage solutions, and experience partnering with customers to support RFP development. This role offers the opportunity to work at the cutting edge of AI & HPC infrastructure, solving complex technical challenges and helping customers implement transformative datacenter solutions at scale. Key Responsibilities Lead customer technical discovery with data/ML, platform, and infrastructure stakeholders; map business goals to AI & HPC workloads and success metrics. Assess current system state (GPUs/accelerators, storage, fabric, security) and identify gaps, risks, and define required POCs. Shape reference architectures for large AI & HPC clusters (rack design, GPU topology, RoCE/InfiniBand, NVMe/parallel FS) aligned to customer constraints (power, cooling, space). Create high level design. Partner with the business development and product teams to build ROI/TCO models. (CapEx/OpEx, $/token, $/inference) and craft the value story. Support draft of technical sections of RFIs/RFPs; produce architecture diagrams, deployment plans, and implementation timelines. Partner with program & engineering teams to define POC success criteria, test plans, and exit reports. Collaborate with product management to foster product roadmap improvements. Network design for high throughput GPU clusters (scale up / scale out / OOB), cabling. Storage architectures optimized for AI data pipelines. Datacenter layout strategies / power / cooling. Rack power delivery / mechanicals. Required Experience Deep hands on experience designing and implementing large scale GPU based infrastructure solutions, including datacenter network and storage architectures. Proven track record of creating technical documentation and reference architectures. Excellent communication skills with the ability to explain complex technical concepts. Experience working directly with customer technical teams. Academic Credentials Bachelor's degree or higher in Computer Science, Electrical Engineering or closely related field. Location San Jose, CA This role is not eligible for visa sponsorship. Benefits offered are described: AMD benefits at a glance. Equal Employment Opportunity AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process. AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's "Responsible AI Policy" is available here. This posting is for an existing vacancy.

04/02/2026

Full time

This range is provided by AMD. Your actual pay will be based on your skills and experience - talk with your recruiter to learn more. The Role The AMD Datacenter GPU team is seeking an experienced Principal Solutions Architect to join our team focused on enabling large clusters for AI & HPC workloads. The Person The candidate will be a technical expert in datacenter infrastructure with deep knowledge of datacenter design, strong knowledge of compute (CPUs/GPUs), networking, and storage solutions, and experience partnering with customers to support RFP development. This role offers the opportunity to work at the cutting edge of AI & HPC infrastructure, solving complex technical challenges and helping customers implement transformative datacenter solutions at scale. Key Responsibilities Lead customer technical discovery with data/ML, platform, and infrastructure stakeholders; map business goals to AI & HPC workloads and success metrics. Assess current system state (GPUs/accelerators, storage, fabric, security) and identify gaps, risks, and define required POCs. Shape reference architectures for large AI & HPC clusters (rack design, GPU topology, RoCE/InfiniBand, NVMe/parallel FS) aligned to customer constraints (power, cooling, space). Create high level design. Partner with the business development and product teams to build ROI/TCO models. (CapEx/OpEx, $/token, $/inference) and craft the value story. Support draft of technical sections of RFIs/RFPs; produce architecture diagrams, deployment plans, and implementation timelines. Partner with program & engineering teams to define POC success criteria, test plans, and exit reports. Collaborate with product management to foster product roadmap improvements. Network design for high throughput GPU clusters (scale up / scale out / OOB), cabling. Storage architectures optimized for AI data pipelines. Datacenter layout strategies / power / cooling. Rack power delivery / mechanicals. Required Experience Deep hands on experience designing and implementing large scale GPU based infrastructure solutions, including datacenter network and storage architectures. Proven track record of creating technical documentation and reference architectures. Excellent communication skills with the ability to explain complex technical concepts. Experience working directly with customer technical teams. Academic Credentials Bachelor's degree or higher in Computer Science, Electrical Engineering or closely related field. Location San Jose, CA This role is not eligible for visa sponsorship. Benefits offered are described: AMD benefits at a glance. Equal Employment Opportunity AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process. AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's "Responsible AI Policy" is available here. This posting is for an existing vacancy.

Senior Solution Architect - AI / GPU Cloud

GMI Cloud Mountain View, California

Senior Solution Architect - AI / GPU Cloud We are seeking a Senior Solution Architect to design GPU cloud and AI infrastructure solutions, lead PoCs and benchmarks, guide customers through deployment, and partner closely with engineering and operations teams at GMI Cloud. About GMI Cloud GMI Cloud is a fast growing AI infrastructure company backed by Headline VC. We operate hundreds of megawatts of AI ready data center capacity across North America and a growing AI Factory footprint in Asia, delivering a full spectrum of services from GPU compute to AI model inference API solutions. As an NVIDIA Reference Platform Cloud Partner, our infrastructure meets the highest standards for performance, security, and scalability in AI deployments. Role Overview As a Solution Architect, you will be the primary technical interface for our enterprise and hyperscaler accounts and help customers build AI without limits. Key Responsibilities Serve as the primary technical point of contact for enterprise and hyperscaler customers. Deeply understand customer AI/ML/HPC workloads, scaling requirements, and deployment models. Architect GPU clusters, storage, networking, and orchestration solutions tailored to customer needs. Lead proof of concepts, benchmarks, and workshops demonstrating performance, reliability, and scalability. Produce technical proposals, architecture diagrams, capacity plans, and cost/performance recommendations. Translate complex technical issues into clear actions for both engineering and business stakeholders. Guide customers through onboarding, cluster setup, performance tuning, and scaling. Partner with internal infra, DC ops, and engineering teams to ensure smooth delivery and implementation. Identify optimization opportunities in customer workloads (GPU utilization, networking, scheduling, cost). Act as a trusted advisor on GPU/AI infrastructure best practices, roadmap, and long term planning. Maintain regular technical check ins, capacity reviews, and performance reviews with customers. Gather customer feedback and collaborate with product/engineering to improve our platform. Required Qualifications Technical Background 5-10+ years in cloud infrastructure, GPU cloud, HPC, AI/ML infrastructure, or data center engineering. Strong understanding of distributed training & inference architectures, Kubernetes, Slurm, or other cluster/orchestration systems, NVIDIA GPU stack (H100/H200/B200/GB200 or similar), InfiniBand/high speed networking, and storage architectures for AI workloads. Customer Facing Skills Experience working directly with enterprise or hyperscaler technical teams. Ability to simplify complex infra concepts for both technical and non technical audiences. Strong communication, solution design, and project coordination skills. Soft Skills Self starter, ownership mindset, excellent follow through. Comfortable working in a fast moving, high growth environment. Strong problem solving and "architect + advisor" mentality. Preferred Qualifications (Nice to Have) Hands on with large scale GPU deployments (multi node, multi cluster). Exposure to hyperscaler capacity planning or AI infrastructure procurement teams. Experience with multi region or global GPU deployments (US + APAC/Taiwan). Why Join GMI Cloud Work directly with some of the world's most advanced AI organizations. Architect and deliver multi MW GPU clusters at global scale. Influence product roadmap and partner closely with NVIDIA and top tier data center providers. High impact role with significant ownership and career growth.

04/02/2026

Full time

Senior Solution Architect - AI / GPU Cloud We are seeking a Senior Solution Architect to design GPU cloud and AI infrastructure solutions, lead PoCs and benchmarks, guide customers through deployment, and partner closely with engineering and operations teams at GMI Cloud. About GMI Cloud GMI Cloud is a fast growing AI infrastructure company backed by Headline VC. We operate hundreds of megawatts of AI ready data center capacity across North America and a growing AI Factory footprint in Asia, delivering a full spectrum of services from GPU compute to AI model inference API solutions. As an NVIDIA Reference Platform Cloud Partner, our infrastructure meets the highest standards for performance, security, and scalability in AI deployments. Role Overview As a Solution Architect, you will be the primary technical interface for our enterprise and hyperscaler accounts and help customers build AI without limits. Key Responsibilities Serve as the primary technical point of contact for enterprise and hyperscaler customers. Deeply understand customer AI/ML/HPC workloads, scaling requirements, and deployment models. Architect GPU clusters, storage, networking, and orchestration solutions tailored to customer needs. Lead proof of concepts, benchmarks, and workshops demonstrating performance, reliability, and scalability. Produce technical proposals, architecture diagrams, capacity plans, and cost/performance recommendations. Translate complex technical issues into clear actions for both engineering and business stakeholders. Guide customers through onboarding, cluster setup, performance tuning, and scaling. Partner with internal infra, DC ops, and engineering teams to ensure smooth delivery and implementation. Identify optimization opportunities in customer workloads (GPU utilization, networking, scheduling, cost). Act as a trusted advisor on GPU/AI infrastructure best practices, roadmap, and long term planning. Maintain regular technical check ins, capacity reviews, and performance reviews with customers. Gather customer feedback and collaborate with product/engineering to improve our platform. Required Qualifications Technical Background 5-10+ years in cloud infrastructure, GPU cloud, HPC, AI/ML infrastructure, or data center engineering. Strong understanding of distributed training & inference architectures, Kubernetes, Slurm, or other cluster/orchestration systems, NVIDIA GPU stack (H100/H200/B200/GB200 or similar), InfiniBand/high speed networking, and storage architectures for AI workloads. Customer Facing Skills Experience working directly with enterprise or hyperscaler technical teams. Ability to simplify complex infra concepts for both technical and non technical audiences. Strong communication, solution design, and project coordination skills. Soft Skills Self starter, ownership mindset, excellent follow through. Comfortable working in a fast moving, high growth environment. Strong problem solving and "architect + advisor" mentality. Preferred Qualifications (Nice to Have) Hands on with large scale GPU deployments (multi node, multi cluster). Exposure to hyperscaler capacity planning or AI infrastructure procurement teams. Experience with multi region or global GPU deployments (US + APAC/Taiwan). Why Join GMI Cloud Work directly with some of the world's most advanced AI organizations. Architect and deliver multi MW GPU clusters at global scale. Influence product roadmap and partner closely with NVIDIA and top tier data center providers. High impact role with significant ownership and career growth.

Solutions Architect

Together AI San Francisco, California

Overview Solutions Architect Location: San Francisco, CA (Hybrid) About the role: As a Solutions Architect at Together AI, you will work with customers and prospects to create business value through Generative AI applications. Solutions Architects at Together are trusted advisors to our customers that evaluate, identify and demonstrate how Together can solve their AI needs. As key contributors to our sales organization, Solution Engineers add tremendous value to the customer journey and directly impact company growth and revenue. This is an exciting opportunity for a deeply technical professional passionate about AI and customer success to make a significant impact in a fast-paced, innovative environment. Responsibilities Act as a technical advisor to our most strategic customers, deeply embedding with them to support the ideation and development of innovative applications using OSS models on Together AI Run complex demonstrations and POCs of Together's entire stack, including both hardware and software solutions Collaborate with sales to qualify new prospects and support existing customers along their journey to build cutting-edge Generative AI solutions Build and maintain strong relationships with customer leadership and stakeholders, ensuring the successful deployment and scaling of their applications Deliver high-value feedback to our Product, Engineering, and Research teams, ensuring our platform continues to evolve to meet customer needs Build educational content and tooling for both internal and external use around Together's solutions (i.e., playbooks, blogs, demos, etc.) Qualifications 5+ years of experience in a customer-facing technical role with at least 2 years in a pre-sales function Excellent communication and interpersonal skills, with the ability to explain complex technical concepts to non-technical stakeholders. Ability to consult with new and existing customers to map business needs to technical solutions Strong technical background, with knowledge of AI, ML, GPU technologies and their integration into high-performance computing (HPC) environments. Strong understanding of training, fine-tuning and inference in the context of open source LLMs Proficiency in Python and JavaScript, with experience building and delivering prototypes on API platforms. Familiarity with infrastructure services (e.g., Kubernetes, SLURM), infrastructure as code solutions (e.g., Ansible), container infrastructure Docker), and scripting and programming languages (Python, Javascript) Strong sense of ownership and willingness to learn new skills to ensure both team and customer success. Ability to operate in dynamic environments, adept at managing multiple projects, and comfortable with frequent context switching and prioritization. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $180-260K + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our Privacy Policy at

04/02/2026

Full time

Overview Solutions Architect Location: San Francisco, CA (Hybrid) About the role: As a Solutions Architect at Together AI, you will work with customers and prospects to create business value through Generative AI applications. Solutions Architects at Together are trusted advisors to our customers that evaluate, identify and demonstrate how Together can solve their AI needs. As key contributors to our sales organization, Solution Engineers add tremendous value to the customer journey and directly impact company growth and revenue. This is an exciting opportunity for a deeply technical professional passionate about AI and customer success to make a significant impact in a fast-paced, innovative environment. Responsibilities Act as a technical advisor to our most strategic customers, deeply embedding with them to support the ideation and development of innovative applications using OSS models on Together AI Run complex demonstrations and POCs of Together's entire stack, including both hardware and software solutions Collaborate with sales to qualify new prospects and support existing customers along their journey to build cutting-edge Generative AI solutions Build and maintain strong relationships with customer leadership and stakeholders, ensuring the successful deployment and scaling of their applications Deliver high-value feedback to our Product, Engineering, and Research teams, ensuring our platform continues to evolve to meet customer needs Build educational content and tooling for both internal and external use around Together's solutions (i.e., playbooks, blogs, demos, etc.) Qualifications 5+ years of experience in a customer-facing technical role with at least 2 years in a pre-sales function Excellent communication and interpersonal skills, with the ability to explain complex technical concepts to non-technical stakeholders. Ability to consult with new and existing customers to map business needs to technical solutions Strong technical background, with knowledge of AI, ML, GPU technologies and their integration into high-performance computing (HPC) environments. Strong understanding of training, fine-tuning and inference in the context of open source LLMs Proficiency in Python and JavaScript, with experience building and delivering prototypes on API platforms. Familiarity with infrastructure services (e.g., Kubernetes, SLURM), infrastructure as code solutions (e.g., Ansible), container infrastructure Docker), and scripting and programming languages (Python, Javascript) Strong sense of ownership and willingness to learn new skills to ensure both team and customer success. Ability to operate in dynamic environments, adept at managing multiple projects, and comfortable with frequent context switching and prioritization. About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $180-260K + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our Privacy Policy at

Senior Product Manager (Networking, VPN, Cryptography)

Palo Alto Networks Santa Clara, California

3 days ago Be among the first 25 applicants Get AI-powered advice on this job and more exclusive features. Our Mission At Palo Alto Networks everything starts and ends with our mission: Our Mission At Palo Alto Networks everything starts and ends with our mission: Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and more secure than the one before. We are a company built on the foundation of challenging and disrupting the way things are done, and we're looking for innovators who are as committed to shaping the future of cybersecurity as we are. Who We Are We take our mission of protecting the digital way of life seriously. We are relentless in protecting our customers and we believe that the unique ideas of every member of our team contributes to our collective success. Our values were crowdsourced by employees and are brought to life through each of us everyday - from disruptive innovation and collaboration, to execution. From showing up for each other with integrity to creating an environment where we all feel included. As a member of our team, you will be shaping the future of cybersecurity. We work fast, value ongoing learning, and we respect each employee as a unique individual. Knowing we all have different needs, our development and personal wellbeing programs are designed to give you choice in how you are supported. This includes our FLEXBenefits wellbeing spending account with over 1,000 eligible items selected by employees, our mental and financial health resources, and our personalized learning opportunities - just to name a few! At Palo Alto Networks, we believe in the power of collaboration and value in-person interactions. This is why our employees generally work full time from our office with flexibility offered where needed. This setup fosters casual conversations, problem-solving, and trusted relationships. Our goal is to create an environment where we all win with precision. Job Description Your Career Palo Alto Networks has built a world-class product management organization and continues to look for top-notch talent to tackle challenging security problems. As a Senior Product Manager for PAN-OS networking and VPN, you will play an instrumental role in managing the lifecycle of the core PAN-OS infrastructure, which powers our next-generation firewalls and multiple cloud-based as-a-service offerings. This role will be instrumental in driving the adoption of PAN-OS VPN connectivity technologies across our Strata Portfolio (NGFW, SWFW, Prisma Access, SD-WAN, Panorama, Strata Cloud Manager). This role will enable our customers to deploy our products across the cloud, data centers, service providers, enterprises, and much more as we innovate with new security features, from threat prevention to Gen AI, and hardware. This high-visibility position will drive the business of our core Strata network security platform and enable the company to succeed in its platformization strategy. Your Impact Establish your vision, product strategy, product roadmap, and technology requirements as part of the market and product requirements Develop, manage & communicate product roadmap to customers, prospects, and all internal key stakeholders Manage key features across our PAN-OS VPN and networking stack, which includes: Site-to-site VPN tunneling protocols such as IPSEC, GRE, GENEVE, Wireguard, MASQUE Security Protocols such as IKE, SSL, TLS Drive the strategic vision of the future on network connectivity Conduct market research and competitive analysis to identify new opportunities and stay ahead of industry trends. Collaborate with sales and marketing teams to develop go-to-market strategies and support sales enablement. Monitor and analyze product performance metrics to inform ongoing product enhancements Qualifications Your Experience As a Product Manager responsible for VPN and networking, you must have the following skills: Deep understanding of network security, routing protocols, and L2 switching Experience with cryptography algorithms, post-quantum cryptography, and Quantum computing Deep expertise across the OSI model L2-L7 Experience with highly scalable platforms with ASIC and FPGA implementations Worked on compliance requirements for different verticals across Finance, Public Sector, and Healthcare Network design expertise with the following deployment architectures: Cloud Service Providers Private and public cloud data centers service provider data center, Large enterprise networks Branch / remote office networks Financial networks High-performance computing (HPC) GTM experience launching software features Ability to write technical content (blogs, deployment guides, white papers, technical briefs, etc.) for both our internal sales teams and external customers. Experience building extensive GTM, technical, and strategic decks with business acumen and clarity You will present in front of internal and external audiences Collaborate with the cross-functional teams across our portfolio (NGFW, SWFW, Prisma Access, SD-WAN, Panorama, Strata Cloud Manager) to drive networking features Partner closely with engineering to develop proactive upgrade strategies Collaborate with customer success and the field organization to implement the upgrade strategies Develop insights into customer deployments and provide recommendations to accelerate feature adoption BS/MS in electrical engineering, computer science, or computer engineering preferred. 5-7 years of experience in a Product Manager, Technical Marketing, engineering lead role (software development, release, or quality assurance), or expertise in managing upgrades of large network appliance systems Any experience in product management is a plus. Experience in hands-on operational, i.e., real-world, device and appliance management Experience in routing, switching, security, or firewall technology is a plus, especially with PAN-OS NGFW firewalls. Strong understanding of networking RFCs, frameworks, and architectures Strong communication skills and ability to interface cross-functionally as well as with external customers, field teams, and partners Strong analytical and problem-solving skills. Experience with Agile methodologies and tools (e.g., Jira, Confluence). Ability to work in a fast-paced environment and manage multiple priorities. Proven track record of delivering successful products from concept to launch. Additional Information We're an innovative team focused on delivering best-in-class, threat detection services. Our mission is to protect our customers' public cloud workloads with resilient, scalable, and always-on firewall solutions. As the leading FWaaS provider, we successfully integrated in AWS, Azure, Google Cloud, and Oracle Cloud. Driven by a commitment to excellence, we set the standard for best-in-class security across even the most complex cloud infrastructures. We also happen to be one of the fastest-growing Security Products within Palo Alto Networks. Compensation Disclosure The compensation offered for this position will depend on qualifications, experience, and work location. For candidates who receive an offer at the posted level, the starting base salary (for non-sales roles) or base salary + commission target (for sales/commissioned roles) is expected to be between $0 - $0/YR. The offered compensation may also include restricted stock units and a bonus. A description of our employee benefits may be found here. Our Commitment We're problem solvers that take risks and challenge cybersecurity's status quo. It's simple: we can't accomplish our mission without diverse teams innovating, together. We are committed to providing reasonable accommodations for all qualified individuals with a disability. If you require assistance or accommodation due to a disability or special need, please contact us at . Palo Alto Networks is an equal opportunity employer. We celebrate diversity in our workplace, and all qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or other legally protected characteristics. All your information will be kept confidential according to EEO guidelines. Our Commitment We're problem solvers that take risks and challenge cybersecurity's status quo. It's simple: we can't accomplish our mission without diverse teams innovating, together. We are committed to providing reasonable accommodations for all qualified individuals with a disability. If you require assistance or accommodation due to a disability or special need, please contact us at . Palo Alto Networks is an equal opportunity employer. We celebrate diversity in our workplace, and all qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information . click apply for full job details

04/02/2026

Full time

3 days ago Be among the first 25 applicants Get AI-powered advice on this job and more exclusive features. Our Mission At Palo Alto Networks everything starts and ends with our mission: Our Mission At Palo Alto Networks everything starts and ends with our mission: Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and more secure than the one before. We are a company built on the foundation of challenging and disrupting the way things are done, and we're looking for innovators who are as committed to shaping the future of cybersecurity as we are. Who We Are We take our mission of protecting the digital way of life seriously. We are relentless in protecting our customers and we believe that the unique ideas of every member of our team contributes to our collective success. Our values were crowdsourced by employees and are brought to life through each of us everyday - from disruptive innovation and collaboration, to execution. From showing up for each other with integrity to creating an environment where we all feel included. As a member of our team, you will be shaping the future of cybersecurity. We work fast, value ongoing learning, and we respect each employee as a unique individual. Knowing we all have different needs, our development and personal wellbeing programs are designed to give you choice in how you are supported. This includes our FLEXBenefits wellbeing spending account with over 1,000 eligible items selected by employees, our mental and financial health resources, and our personalized learning opportunities - just to name a few! At Palo Alto Networks, we believe in the power of collaboration and value in-person interactions. This is why our employees generally work full time from our office with flexibility offered where needed. This setup fosters casual conversations, problem-solving, and trusted relationships. Our goal is to create an environment where we all win with precision. Job Description Your Career Palo Alto Networks has built a world-class product management organization and continues to look for top-notch talent to tackle challenging security problems. As a Senior Product Manager for PAN-OS networking and VPN, you will play an instrumental role in managing the lifecycle of the core PAN-OS infrastructure, which powers our next-generation firewalls and multiple cloud-based as-a-service offerings. This role will be instrumental in driving the adoption of PAN-OS VPN connectivity technologies across our Strata Portfolio (NGFW, SWFW, Prisma Access, SD-WAN, Panorama, Strata Cloud Manager). This role will enable our customers to deploy our products across the cloud, data centers, service providers, enterprises, and much more as we innovate with new security features, from threat prevention to Gen AI, and hardware. This high-visibility position will drive the business of our core Strata network security platform and enable the company to succeed in its platformization strategy. Your Impact Establish your vision, product strategy, product roadmap, and technology requirements as part of the market and product requirements Develop, manage & communicate product roadmap to customers, prospects, and all internal key stakeholders Manage key features across our PAN-OS VPN and networking stack, which includes: Site-to-site VPN tunneling protocols such as IPSEC, GRE, GENEVE, Wireguard, MASQUE Security Protocols such as IKE, SSL, TLS Drive the strategic vision of the future on network connectivity Conduct market research and competitive analysis to identify new opportunities and stay ahead of industry trends. Collaborate with sales and marketing teams to develop go-to-market strategies and support sales enablement. Monitor and analyze product performance metrics to inform ongoing product enhancements Qualifications Your Experience As a Product Manager responsible for VPN and networking, you must have the following skills: Deep understanding of network security, routing protocols, and L2 switching Experience with cryptography algorithms, post-quantum cryptography, and Quantum computing Deep expertise across the OSI model L2-L7 Experience with highly scalable platforms with ASIC and FPGA implementations Worked on compliance requirements for different verticals across Finance, Public Sector, and Healthcare Network design expertise with the following deployment architectures: Cloud Service Providers Private and public cloud data centers service provider data center, Large enterprise networks Branch / remote office networks Financial networks High-performance computing (HPC) GTM experience launching software features Ability to write technical content (blogs, deployment guides, white papers, technical briefs, etc.) for both our internal sales teams and external customers. Experience building extensive GTM, technical, and strategic decks with business acumen and clarity You will present in front of internal and external audiences Collaborate with the cross-functional teams across our portfolio (NGFW, SWFW, Prisma Access, SD-WAN, Panorama, Strata Cloud Manager) to drive networking features Partner closely with engineering to develop proactive upgrade strategies Collaborate with customer success and the field organization to implement the upgrade strategies Develop insights into customer deployments and provide recommendations to accelerate feature adoption BS/MS in electrical engineering, computer science, or computer engineering preferred. 5-7 years of experience in a Product Manager, Technical Marketing, engineering lead role (software development, release, or quality assurance), or expertise in managing upgrades of large network appliance systems Any experience in product management is a plus. Experience in hands-on operational, i.e., real-world, device and appliance management Experience in routing, switching, security, or firewall technology is a plus, especially with PAN-OS NGFW firewalls. Strong understanding of networking RFCs, frameworks, and architectures Strong communication skills and ability to interface cross-functionally as well as with external customers, field teams, and partners Strong analytical and problem-solving skills. Experience with Agile methodologies and tools (e.g., Jira, Confluence). Ability to work in a fast-paced environment and manage multiple priorities. Proven track record of delivering successful products from concept to launch. Additional Information We're an innovative team focused on delivering best-in-class, threat detection services. Our mission is to protect our customers' public cloud workloads with resilient, scalable, and always-on firewall solutions. As the leading FWaaS provider, we successfully integrated in AWS, Azure, Google Cloud, and Oracle Cloud. Driven by a commitment to excellence, we set the standard for best-in-class security across even the most complex cloud infrastructures. We also happen to be one of the fastest-growing Security Products within Palo Alto Networks. Compensation Disclosure The compensation offered for this position will depend on qualifications, experience, and work location. For candidates who receive an offer at the posted level, the starting base salary (for non-sales roles) or base salary + commission target (for sales/commissioned roles) is expected to be between $0 - $0/YR. The offered compensation may also include restricted stock units and a bonus. A description of our employee benefits may be found here. Our Commitment We're problem solvers that take risks and challenge cybersecurity's status quo. It's simple: we can't accomplish our mission without diverse teams innovating, together. We are committed to providing reasonable accommodations for all qualified individuals with a disability. If you require assistance or accommodation due to a disability or special need, please contact us at . Palo Alto Networks is an equal opportunity employer. We celebrate diversity in our workplace, and all qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or other legally protected characteristics. All your information will be kept confidential according to EEO guidelines. Our Commitment We're problem solvers that take risks and challenge cybersecurity's status quo. It's simple: we can't accomplish our mission without diverse teams innovating, together. We are committed to providing reasonable accommodations for all qualified individuals with a disability. If you require assistance or accommodation due to a disability or special need, please contact us at . Palo Alto Networks is an equal opportunity employer. We celebrate diversity in our workplace, and all qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information . click apply for full job details

System Engineer, GPU Fleet

Fluidstack San Francisco, California

Overview About Fluidstack: At Fluidstack, we're building the infrastructure for abundant intelligence. We partner with top AI labs, governments, and enterprises to unlock compute at the speed of light. We're working with urgency to make AGI a reality and are looking for motivated individuals who are committed to delivering world-class infrastructure. If you're motivated by purpose, obsessed with excellence, and ready to work very hard to accelerate the future of intelligence, join us in building what's next. Role As a System Engineer, GPU Fleet, you will manage, operate, and optimize hyperscale GPU compute infrastructure supporting AI/ML training and inference workloads. Ensure high availability, performance, and reliability of the GPU server fleet through automation, monitoring, troubleshooting, and collaboration with hardware engineering, platform teams, and datacenter operations. Responsibilities Operate and maintain a large-scale GPU server fleet (H100, B200, GB200) supporting AI/ML workloads; monitor system health, performance, and utilization to maximize uptime and ensure SLA compliance. Perform hands-on troubleshooting and root cause analysis of complex hardware, firmware, OS, and application issues across GPU clusters; coordinate with vendors and hardware teams to resolve systemic failures. Develop and maintain automation scripts for provisioning, configuration management, monitoring, and remediation at scale. Build and improve tooling for GPU health checks, performance diagnostics, driver validation, and automated recovery. Execute server provisioning, configuration, firmware updates, and OS installation using automation frameworks; manage lifecycle operations including deployment, maintenance, and decommissioning. Participate in 24x7 on-call rotation; respond to production incidents and coordinate resolution with cross-functional teams including datacenter operations, network engineering, and application teams. Lead post-incident reviews, document root causes, and drive continuous improvement initiatives focused on automation, reliability, monitoring, and operational efficiency. Basic Qualifications Bachelor's degree in Computer Science, Engineering, or related technical field (or equivalent practical experience). 3+ years (System Engineer) or 5+ years (Senior System Engineer) in Linux system administration, datacenter operations, or infrastructure engineering. Strong Linux/Unix fundamentals including system administration, shell scripting (Bash, Python), troubleshooting, and performance tuning. Experience with server hardware architecture, troubleshooting techniques, and understanding of compute, memory, storage, and networking components. Experience in automation and configuration management tools (Ansible, Puppet, Chef, Terraform). Strong analytical and problem-solving skills with ability to diagnose complex technical issues under pressure. Excellent communication and collaboration skills; ability to work effectively with cross-functional teams. Preferred Qualifications Experience managing large-scale GPU infrastructure (NVIDIA H100, A100, B200, GB200) in production environments supporting AI/ML workloads. Deep knowledge of GPU architecture, CUDA toolkit, GPU drivers, monitoring tools (nvidia-smi, DCGM). Experience with HPC cluster management, job schedulers (Slurm, PBS, LSF), and container orchestration (Kubernetes, Docker). Proficiency in out-of-band management protocols (IPMI, Redfish, BMC) and firmware management for server hardware. Experience with high-performance networking (InfiniBand, RoCE, RDMA) and network troubleshooting in GPU cluster environments. Familiarity with datacenter operations including rack installations, cabling, power management, and thermal considerations. Salary & Benefits Competitive total compensation package (salary + equity). Retirement or pension plan, in line with local norms. Health, dental, and vision insurance. Generous PTO policy, in line with local norms. The base salary range for this position is $200,000 - $300,000 per year, depending on experience, skills, qualifications, and location. This range represents our good faith estimate of the compensation for this role at the time of posting. Total compensation may also include equity in the form of stock options. We are committed to pay equity and transparency. Fluidstack is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans' status, or any other characteristic protected by law. Fluidstack will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law. You will receive a confirmation email once your application has successfully been accepted. If there is an error with your submission and you did not receive a confirmation email, please email with your resume/CV, the role you've applied for, and the date you submitted your application. Someone from our recruiting team will be in touch.

04/02/2026

Full time

Overview About Fluidstack: At Fluidstack, we're building the infrastructure for abundant intelligence. We partner with top AI labs, governments, and enterprises to unlock compute at the speed of light. We're working with urgency to make AGI a reality and are looking for motivated individuals who are committed to delivering world-class infrastructure. If you're motivated by purpose, obsessed with excellence, and ready to work very hard to accelerate the future of intelligence, join us in building what's next. Role As a System Engineer, GPU Fleet, you will manage, operate, and optimize hyperscale GPU compute infrastructure supporting AI/ML training and inference workloads. Ensure high availability, performance, and reliability of the GPU server fleet through automation, monitoring, troubleshooting, and collaboration with hardware engineering, platform teams, and datacenter operations. Responsibilities Operate and maintain a large-scale GPU server fleet (H100, B200, GB200) supporting AI/ML workloads; monitor system health, performance, and utilization to maximize uptime and ensure SLA compliance. Perform hands-on troubleshooting and root cause analysis of complex hardware, firmware, OS, and application issues across GPU clusters; coordinate with vendors and hardware teams to resolve systemic failures. Develop and maintain automation scripts for provisioning, configuration management, monitoring, and remediation at scale. Build and improve tooling for GPU health checks, performance diagnostics, driver validation, and automated recovery. Execute server provisioning, configuration, firmware updates, and OS installation using automation frameworks; manage lifecycle operations including deployment, maintenance, and decommissioning. Participate in 24x7 on-call rotation; respond to production incidents and coordinate resolution with cross-functional teams including datacenter operations, network engineering, and application teams. Lead post-incident reviews, document root causes, and drive continuous improvement initiatives focused on automation, reliability, monitoring, and operational efficiency. Basic Qualifications Bachelor's degree in Computer Science, Engineering, or related technical field (or equivalent practical experience). 3+ years (System Engineer) or 5+ years (Senior System Engineer) in Linux system administration, datacenter operations, or infrastructure engineering. Strong Linux/Unix fundamentals including system administration, shell scripting (Bash, Python), troubleshooting, and performance tuning. Experience with server hardware architecture, troubleshooting techniques, and understanding of compute, memory, storage, and networking components. Experience in automation and configuration management tools (Ansible, Puppet, Chef, Terraform). Strong analytical and problem-solving skills with ability to diagnose complex technical issues under pressure. Excellent communication and collaboration skills; ability to work effectively with cross-functional teams. Preferred Qualifications Experience managing large-scale GPU infrastructure (NVIDIA H100, A100, B200, GB200) in production environments supporting AI/ML workloads. Deep knowledge of GPU architecture, CUDA toolkit, GPU drivers, monitoring tools (nvidia-smi, DCGM). Experience with HPC cluster management, job schedulers (Slurm, PBS, LSF), and container orchestration (Kubernetes, Docker). Proficiency in out-of-band management protocols (IPMI, Redfish, BMC) and firmware management for server hardware. Experience with high-performance networking (InfiniBand, RoCE, RDMA) and network troubleshooting in GPU cluster environments. Familiarity with datacenter operations including rack installations, cabling, power management, and thermal considerations. Salary & Benefits Competitive total compensation package (salary + equity). Retirement or pension plan, in line with local norms. Health, dental, and vision insurance. Generous PTO policy, in line with local norms. The base salary range for this position is $200,000 - $300,000 per year, depending on experience, skills, qualifications, and location. This range represents our good faith estimate of the compensation for this role at the time of posting. Total compensation may also include equity in the form of stock options. We are committed to pay equity and transparency. Fluidstack is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans' status, or any other characteristic protected by law. Fluidstack will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law. You will receive a confirmation email once your application has successfully been accepted. If there is an error with your submission and you did not receive a confirmation email, please email with your resume/CV, the role you've applied for, and the date you submitted your application. Someone from our recruiting team will be in touch.

Principal SW Engineer

Broadcom

Overview Ethernet NIC product portfolio is designed for high performance computing and networking applications including AI and ML. This is driven by the growing need for high server bandwidth, highest throughput and lowest latency networks. The NIC team is involved in the development of the next generation of Ethernet NIC solutions for AI/ML and High performance computing applications. We are looking for excellent software and firmware engineers to join the NIC product development team. Responsibilities Design and development of the ROCE Driver, ROCE Tools, ROCE Library, and Firmware features for Ethernet NIC products. Design, develop, and maintain driver code in Linux kernel and firmware code for embedded systems using C/C++. Develop, optimize, and debug low level drivers, protocols, and real time features. Work closely with Architecture teams, Silicon Design teams and other Software / Firmware teams to architect, design and implement scalable and high performance applications. Author and contribute to software design, development, validation, and documentation to deliver high quality, high performance and functionally excellent products. Work with the QA team to define high quality test cases, review tests and support through the release development cycle. Work closely with Customer Support Engineers on any customer field issues and provide timely resolutions. Work with the Linux community to upstream driver code to the public repo. Requirements BE in Computer Science/ Electronics & Communications and 12+ years of experience, or MS + 10+ years of experience. Significant experience in RDMA protocol, Linux Systems programming, Linux kernel, Linux Network Drivers, Linux Kernel Networking, Virtual Switching, Data center Networking, Firmware development. Good understanding of the RDMA protocol; hands on experience with RDMA is highly desired. Excellent programming skills in C, C++ and Python. Proficiency in developing optimized code in both x86 and ARM64 compiler toolchains. Experience analyzing and tuning performance for a variety of AI/ML and HPC workloads. Deep knowledge of Linux kernel and Linux kernel networking is an added advantage. Experience in writing test scripts to verify NIC behavior is highly desired. Understanding of schematics, datasheets, and hardware interfaces. Strong analytical, problem solving and debugging skills in combined Software and Hardware environments. Excellent written and verbal communication skills, ability to efficiently collaborate with multiple teams across geographically diverse areas. Compensation and Benefits The annual base salary range for this position is $141,300 - $226,000. This position is also eligible for a discretionary annual bonus in accordance with relevant plan documents, and equity in accordance with equity plan documents and equity award agreements. Broadcom offers a competitive and comprehensive benefits package: Medical, dental and vision plans, 401(K) participation including company matching, Employee Stock Purchase Program (ESPP), Employee Assistance Program (EAP), company paid holidays, paid sick leave and vacation time. The company follows all applicable laws for Paid Family Leave and other leaves of absence. Equal Opportunity Employer Broadcom is proud to be an equal opportunity employer. We will consider qualified applicants without regard to race, color, creed, religion, sex, sexual orientation, national origin, citizenship, disability status, medical condition, pregnancy, protected veteran status or any other characteristic protected by federal, state, or local law. We will also consider qualified applicants with arrest and conviction records consistent with local law. Location & Application Notes If you are located outside USA, please be sure to fill out a home address as this will be used for future correspondence.

04/02/2026

Full time

Overview Ethernet NIC product portfolio is designed for high performance computing and networking applications including AI and ML. This is driven by the growing need for high server bandwidth, highest throughput and lowest latency networks. The NIC team is involved in the development of the next generation of Ethernet NIC solutions for AI/ML and High performance computing applications. We are looking for excellent software and firmware engineers to join the NIC product development team. Responsibilities Design and development of the ROCE Driver, ROCE Tools, ROCE Library, and Firmware features for Ethernet NIC products. Design, develop, and maintain driver code in Linux kernel and firmware code for embedded systems using C/C++. Develop, optimize, and debug low level drivers, protocols, and real time features. Work closely with Architecture teams, Silicon Design teams and other Software / Firmware teams to architect, design and implement scalable and high performance applications. Author and contribute to software design, development, validation, and documentation to deliver high quality, high performance and functionally excellent products. Work with the QA team to define high quality test cases, review tests and support through the release development cycle. Work closely with Customer Support Engineers on any customer field issues and provide timely resolutions. Work with the Linux community to upstream driver code to the public repo. Requirements BE in Computer Science/ Electronics & Communications and 12+ years of experience, or MS + 10+ years of experience. Significant experience in RDMA protocol, Linux Systems programming, Linux kernel, Linux Network Drivers, Linux Kernel Networking, Virtual Switching, Data center Networking, Firmware development. Good understanding of the RDMA protocol; hands on experience with RDMA is highly desired. Excellent programming skills in C, C++ and Python. Proficiency in developing optimized code in both x86 and ARM64 compiler toolchains. Experience analyzing and tuning performance for a variety of AI/ML and HPC workloads. Deep knowledge of Linux kernel and Linux kernel networking is an added advantage. Experience in writing test scripts to verify NIC behavior is highly desired. Understanding of schematics, datasheets, and hardware interfaces. Strong analytical, problem solving and debugging skills in combined Software and Hardware environments. Excellent written and verbal communication skills, ability to efficiently collaborate with multiple teams across geographically diverse areas. Compensation and Benefits The annual base salary range for this position is $141,300 - $226,000. This position is also eligible for a discretionary annual bonus in accordance with relevant plan documents, and equity in accordance with equity plan documents and equity award agreements. Broadcom offers a competitive and comprehensive benefits package: Medical, dental and vision plans, 401(K) participation including company matching, Employee Stock Purchase Program (ESPP), Employee Assistance Program (EAP), company paid holidays, paid sick leave and vacation time. The company follows all applicable laws for Paid Family Leave and other leaves of absence. Equal Opportunity Employer Broadcom is proud to be an equal opportunity employer. We will consider qualified applicants without regard to race, color, creed, religion, sex, sexual orientation, national origin, citizenship, disability status, medical condition, pregnancy, protected veteran status or any other characteristic protected by federal, state, or local law. We will also consider qualified applicants with arrest and conviction records consistent with local law. Location & Application Notes If you are located outside USA, please be sure to fill out a home address as this will be used for future correspondence.

Software Engineer, Model Inference

OpenAI San Francisco, California

Join to apply for the Software Engineer, Model Inference role at OpenAI About The Team Our Inference team brings OpenAI's most capable research and technology to the world through our products. We empower consumers, enterprise and developers alike to use and access our state of the art AI models, allowing them to do things that they've never been able to before. We focus on performant and efficient model inference, as well as accelerating research progression via model inference. About The Role We are looking for an engineer who wants to take the world's largest and most capable AI models and optimize them for use in a high-volume, low-latency, and high-availability production and research environment. In This Role, You Will Work alongside machine learning researchers, engineers, and product managers to bring our latest technologies into production. Work alongside researchers to enable advanced research through awesome engineering. Introduce new techniques, tools, and architecture that improve the performance, latency, throughput, and efficiency of our model inference stack. Build tools to give us visibility into our bottlenecks and sources of instability and then design and implement solutions to address the highest priority issues. Optimize our code and fleet of Azure VMs to utilize every FLOP and every GB of GPU RAM of our hardware. You Might Thrive In This Role If You Have an understanding of modern ML architectures and an intuition for how to optimize their performance, particularly for inference. Own problems end-to-end, and are willing to pick up whatever knowledge you're missing to get the job done. Have at least 5 years of professional software engineering experience. Have or can quickly gain familiarity with PyTorch, Nvidia GPUs and the software stacks that optimize them (e.g. NCCL CUDA), as well as HPC technologies such as InfiniBand, MPI, NVLink, etc. Have experience architecting, building, observing, and debugging production distributed systems. Bonus point if worked on performance critical distributed systems. Have needed to rebuild or substantially refactor production systems several times over due to rapidly increasing scale. Are self directed and enjoy figuring out the most important problem to work on. Have a humble attitude, an eagerness to help your colleagues, and a desire to do whatever it takes to make the team succeed. About OpenAI OpenAI is an AI research and deployment company dedicated to ensuring that general purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI's Affirmative Action and Equal Employment Opportunity Policy Statement. Background checks for applicants will be administered in accordance with applicable law, and qualified applicants with arrest or conviction records will be considered for employment consistent with those laws, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act, for US based candidates. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations. To notify OpenAI that you believe this job posting is non compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance. We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link. OpenAI Global Applicant Privacy Policy At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology. Compensation Range: $325K - $490K

04/02/2026

Full time

Join to apply for the Software Engineer, Model Inference role at OpenAI About The Team Our Inference team brings OpenAI's most capable research and technology to the world through our products. We empower consumers, enterprise and developers alike to use and access our state of the art AI models, allowing them to do things that they've never been able to before. We focus on performant and efficient model inference, as well as accelerating research progression via model inference. About The Role We are looking for an engineer who wants to take the world's largest and most capable AI models and optimize them for use in a high-volume, low-latency, and high-availability production and research environment. In This Role, You Will Work alongside machine learning researchers, engineers, and product managers to bring our latest technologies into production. Work alongside researchers to enable advanced research through awesome engineering. Introduce new techniques, tools, and architecture that improve the performance, latency, throughput, and efficiency of our model inference stack. Build tools to give us visibility into our bottlenecks and sources of instability and then design and implement solutions to address the highest priority issues. Optimize our code and fleet of Azure VMs to utilize every FLOP and every GB of GPU RAM of our hardware. You Might Thrive In This Role If You Have an understanding of modern ML architectures and an intuition for how to optimize their performance, particularly for inference. Own problems end-to-end, and are willing to pick up whatever knowledge you're missing to get the job done. Have at least 5 years of professional software engineering experience. Have or can quickly gain familiarity with PyTorch, Nvidia GPUs and the software stacks that optimize them (e.g. NCCL CUDA), as well as HPC technologies such as InfiniBand, MPI, NVLink, etc. Have experience architecting, building, observing, and debugging production distributed systems. Bonus point if worked on performance critical distributed systems. Have needed to rebuild or substantially refactor production systems several times over due to rapidly increasing scale. Are self directed and enjoy figuring out the most important problem to work on. Have a humble attitude, an eagerness to help your colleagues, and a desire to do whatever it takes to make the team succeed. About OpenAI OpenAI is an AI research and deployment company dedicated to ensuring that general purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI's Affirmative Action and Equal Employment Opportunity Policy Statement. Background checks for applicants will be administered in accordance with applicable law, and qualified applicants with arrest or conviction records will be considered for employment consistent with those laws, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act, for US based candidates. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations. To notify OpenAI that you believe this job posting is non compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance. We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link. OpenAI Global Applicant Privacy Policy At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology. Compensation Range: $325K - $490K

Senior GPU Software Engineer

AMD Santa Clara, California

Sr. Recruiter / Talent Consultant at AMD, WHAT YOU DO AT AMD CHANGES EVERYTHING At AMD, our mission is to build great products that accelerate next generation computing experiences-from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges-striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career. SENIOR GPU FIRMWARE ENGINEER The Role Join AMD's Datacenter firmware application team as a Firmware Application Engineer, supporting our GPU customers across Cloud, HPC, and OEM segments. In this customer centric role, you will collaborate with external OEM partners, internal development and validation teams, and cross functional stakeholders to bring next generation server platforms to market powered by AMD's Instinct Accelerators-and ensure their successful deployment in customer data centers. The Person An ideal candidate should be familiar with embedded/firmware development, GPU driver/runtime, OS kernel internals, microcontroller fundamentals, hardware power/frequency controls, etc. He/she should be comfortable performing quantitative analysis of workload, pinpoint issues, and drive improvements together with upper layer stack to achieve the ultimate performance. You are a hands on technical problem solver who thrives at the intersection of hardware and software. You enjoy collaborating directly with customers and internal engineering teams to turn complex system challenges into actionable solutions. You'll Excel In This Role If You Are energized by customer engagement and technical troubleshooting. Have strong analytical instincts and a structured approach to problem solving. Communicate clearly and proactively across technical and non technical audiences. Enjoy collaborating across hardware, firmware, and software disciplines. Bring curiosity, creativity, and persistence to complex engineering challenges. Key Responsibilities Manage technical interaction with OEM/ODM Partners to enable deployment of AMD Instinct Accelerators in Partner systems. Work alongside hardware and upper software layers to co optimize the whole AI software stack. Design and build tools for better collecting/presenting GPU performance details correlating to low level hardware characteristics. Support Partners in the bring up and validation of AMD Instinct GPUs in their system, guide partners on use of AMD tools, qualification test methods, and analysis of test results. Lead the debug of Partner/Customer issues (firmware, HW, driver), working with a cross functional team and driving the root cause investigation. Work with Partners on the development of manufacturing/screen tests to ensure reliability at scale. Understand Partner requirements and schedule, identify gaps in AMD offering and work with other stakeholders to close them. Author design guideline, technical presentations, and training material. Provide recommendation to improve customer experience with our SW and HW. Preferred Experience Experience with firmware developments. Experience with embedded software development. Experience with power management and control theory. Experience working on system level reliability and resiliency features. Familiarity with OS kernel/driver internals. Familiarity with GPU architectures and runtimes. Familiarity with microcontroller fundamentals (caches, buses, memory controllers, DMA, etc.). Strong C/C++ programming skills. Strong knowledge in PC/server architecture and interfaces, experience with system level debug. Strong System Level debugging skills with hands on experiences in system bring up, HW debug, and performance optimizations on various system architectures. Understanding and experience working with Enterprise Linux environment (Ubuntu, CentOS/RHEL and SLES). Excellent oral and written communication skills to communicate technical results clearly and accurately. Experience or knowledge of server firmware/BIOS settings, boot process, server monitoring and management SW. Solid knowledge of Shell/BASH, C/C++, Python, or other framework. Experience with OpenCL, CUDA, or ROCm is a plus. Preferred Academic Credentials BS/MS (Computer Science, Computer Engineering, Electrical Engineering, or related equivalent). Location Santa Clara, CA EEO Statement AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.

04/02/2026

Full time

Sr. Recruiter / Talent Consultant at AMD, WHAT YOU DO AT AMD CHANGES EVERYTHING At AMD, our mission is to build great products that accelerate next generation computing experiences-from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges-striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career. SENIOR GPU FIRMWARE ENGINEER The Role Join AMD's Datacenter firmware application team as a Firmware Application Engineer, supporting our GPU customers across Cloud, HPC, and OEM segments. In this customer centric role, you will collaborate with external OEM partners, internal development and validation teams, and cross functional stakeholders to bring next generation server platforms to market powered by AMD's Instinct Accelerators-and ensure their successful deployment in customer data centers. The Person An ideal candidate should be familiar with embedded/firmware development, GPU driver/runtime, OS kernel internals, microcontroller fundamentals, hardware power/frequency controls, etc. He/she should be comfortable performing quantitative analysis of workload, pinpoint issues, and drive improvements together with upper layer stack to achieve the ultimate performance. You are a hands on technical problem solver who thrives at the intersection of hardware and software. You enjoy collaborating directly with customers and internal engineering teams to turn complex system challenges into actionable solutions. You'll Excel In This Role If You Are energized by customer engagement and technical troubleshooting. Have strong analytical instincts and a structured approach to problem solving. Communicate clearly and proactively across technical and non technical audiences. Enjoy collaborating across hardware, firmware, and software disciplines. Bring curiosity, creativity, and persistence to complex engineering challenges. Key Responsibilities Manage technical interaction with OEM/ODM Partners to enable deployment of AMD Instinct Accelerators in Partner systems. Work alongside hardware and upper software layers to co optimize the whole AI software stack. Design and build tools for better collecting/presenting GPU performance details correlating to low level hardware characteristics. Support Partners in the bring up and validation of AMD Instinct GPUs in their system, guide partners on use of AMD tools, qualification test methods, and analysis of test results. Lead the debug of Partner/Customer issues (firmware, HW, driver), working with a cross functional team and driving the root cause investigation. Work with Partners on the development of manufacturing/screen tests to ensure reliability at scale. Understand Partner requirements and schedule, identify gaps in AMD offering and work with other stakeholders to close them. Author design guideline, technical presentations, and training material. Provide recommendation to improve customer experience with our SW and HW. Preferred Experience Experience with firmware developments. Experience with embedded software development. Experience with power management and control theory. Experience working on system level reliability and resiliency features. Familiarity with OS kernel/driver internals. Familiarity with GPU architectures and runtimes. Familiarity with microcontroller fundamentals (caches, buses, memory controllers, DMA, etc.). Strong C/C++ programming skills. Strong knowledge in PC/server architecture and interfaces, experience with system level debug. Strong System Level debugging skills with hands on experiences in system bring up, HW debug, and performance optimizations on various system architectures. Understanding and experience working with Enterprise Linux environment (Ubuntu, CentOS/RHEL and SLES). Excellent oral and written communication skills to communicate technical results clearly and accurately. Experience or knowledge of server firmware/BIOS settings, boot process, server monitoring and management SW. Solid knowledge of Shell/BASH, C/C++, Python, or other framework. Experience with OpenCL, CUDA, or ROCm is a plus. Preferred Academic Credentials BS/MS (Computer Science, Computer Engineering, Electrical Engineering, or related equivalent). Location Santa Clara, CA EEO Statement AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.

Junior Distributed Machine Learning Engineer

Jobright.ai Sunnyvale, California

Junior Distributed Machine Learning Engineer Join to apply for the Junior Distributed Machine Learning Engineer role at Jobright.ai Junior Distributed Machine Learning Engineer 1 day ago Be among the first 25 applicants Join to apply for the Junior Distributed Machine Learning Engineer role at Jobright.ai Get AI-powered advice on this job and more exclusive features. Jobright is an AI-powered career platform that helps job seekers discover the top opportunities in the US. We are NOT a staffing agency. Jobright does not hire directly for these positions. We connect you with verified openings from employers you can trust. Job Summary: Mohamed bin Zayed University of Artificial Intelligence is dedicated to research, innovation, and empowering brilliant minds in AI. The Distributed Machine Learning Engineer will optimize performance for machine learning software stacks, develop new systems, and work alongside researchers to tackle challenges in AI development. Responsibilities: • Understand, analyze, profile, optimize, and provide guidance to the team on deep learning workloads on state-of-the-art hardware and software platforms to improve their efficiency with different levels of optimization • Design and implement performance benchmarks and testing methodologies to evaluate application performance • Build tools to automate workload analysis, workload optimization, and other critical workflows • Triage system issues and identify bottleneck and inefficiencies by analyzing the sources of issues and the impact on hardware, network and propose solutions to enhance GPU utilization • Support the team to develop appropriate kernels and systems for new model architectures and algorithms • Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies. • Review code developed by other developers and provide feedback to ensure best practices (e.g., style guidelines, checking code in, accuracy, testability, and efficiency). • Contribute to existing documentation or educational content and adapt content based on product/program updates and user feedback. • Represent MBZUAI at industry conferences and events, showcasing the institution's cutting-edge HPC and deep learning capabilities and establishing MBZUAI as a global leader in AI research and innovation. • Perform all other duties as reasonably directed by the line manager that are commensurate with these functional objectives. Qualifications: Required: • Ph.D. in CS, EE or CSEE with 1+ years working experience, OR • Masters in CS, EE or CSEE or equivalent experience with 2+ year working experience • Strong background in parallel computing • Hands-on experience in system level coding • Large-scale machine learning experience Company: Official account of Mohamed bin Zayed University of Artificial Intelligence. Dedicated to research, innovation, and empowering brilliant minds in AI. Founded in 2019, the company is headquartered in Abu Dhabi, Abu Dhabi, ARE, with a team of 51-200 employees. The company is currently Growth Stage. MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) has a track record of offering H1B sponsorships. Seniority level Seniority levelEntry level Employment type Employment typeFull-time Job function IndustriesSoftware Development Referrals increase your chances of interviewing at Jobright.ai by 2x Inferred from the description for this job Medical insurance Vision insurance 401(k) Get notified when a new job is posted. Sign in to set job alerts for "Machine Learning Engineer" roles.AI/ML Engineer (Multiple roles and seniority levels)Software Development Engineer In Test (SDET) San Jose, CA $87,600.00-$186,000.00 5 days ago Sunnyvale, CA $117,000.00-$234,000.00 3 weeks ago San Jose, CA $137,500.00-$236,500.00 1 week ago San Jose, CA $120,700.00-$228,600.00 1 week ago New Grads 2025 - Software Engineer - Computer Vision/Deep Learning San Jose, CA $120,000.00-$165,000.00 9 months ago San Francisco Bay Area $155,000.00-$265,500.00 2 weeks ago Sunnyvale, CA $117,000.00-$234,000.00 4 days ago San Jose, CA $120,700.00-$228,600.00 1 week ago San Jose, CA $137,500.00-$236,500.00 1 month ago New Grads 2025 - Software Engineer, Algorithm San Jose, CA $120,000.00-$165,000. months ago San Jose, CA $137,500.00-$236,500.00 3 weeks ago Software Engineer, AI Platform - New Grad Mountain View, CA $145,000.00-$170,000.00 1 week ago Machine Learning Engineer (I, II, or Sr.) Sunnyvale, CA $158,200.00-$185,000.00 4 weeks ago Machine Learning Engineer, Early Stage Project Mountain View, CA $136,000.00-$185,000.00 2 weeks ago San Jose, CA $120,700.00-$228,600.00 1 week ago San Jose, CA $130,000.00-$200,000.00 4 days ago San Jose, CA $169,500.00-$291,500.00 1 week ago Sunnyvale, CA $167,000.00-$185,500.00 1 month ago Sunnyvale, CA $167,000.00-$185,500.00 4 weeks ago San Jose, CA $120,000.00-$240,000.00 5 months ago San Jose, CA $130,000.00-$182,000. months ago Redwood City, CA $167,200.00-$250,800.00 2 weeks ago San Jose, CA $118,657.00-$187,200.00 2 weeks ago San Jose, CA $137,500.00-$236,500.00 4 months ago New Grads 2025 - General Software Engineer San Jose, CA $120,000.00-$165,000.00 6 months ago San Jose, CA $142,700.00-$257,600.00 1 week ago We're unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

04/02/2026

Full time

Junior Distributed Machine Learning Engineer Join to apply for the Junior Distributed Machine Learning Engineer role at Jobright.ai Junior Distributed Machine Learning Engineer 1 day ago Be among the first 25 applicants Join to apply for the Junior Distributed Machine Learning Engineer role at Jobright.ai Get AI-powered advice on this job and more exclusive features. Jobright is an AI-powered career platform that helps job seekers discover the top opportunities in the US. We are NOT a staffing agency. Jobright does not hire directly for these positions. We connect you with verified openings from employers you can trust. Job Summary: Mohamed bin Zayed University of Artificial Intelligence is dedicated to research, innovation, and empowering brilliant minds in AI. The Distributed Machine Learning Engineer will optimize performance for machine learning software stacks, develop new systems, and work alongside researchers to tackle challenges in AI development. Responsibilities: • Understand, analyze, profile, optimize, and provide guidance to the team on deep learning workloads on state-of-the-art hardware and software platforms to improve their efficiency with different levels of optimization • Design and implement performance benchmarks and testing methodologies to evaluate application performance • Build tools to automate workload analysis, workload optimization, and other critical workflows • Triage system issues and identify bottleneck and inefficiencies by analyzing the sources of issues and the impact on hardware, network and propose solutions to enhance GPU utilization • Support the team to develop appropriate kernels and systems for new model architectures and algorithms • Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies. • Review code developed by other developers and provide feedback to ensure best practices (e.g., style guidelines, checking code in, accuracy, testability, and efficiency). • Contribute to existing documentation or educational content and adapt content based on product/program updates and user feedback. • Represent MBZUAI at industry conferences and events, showcasing the institution's cutting-edge HPC and deep learning capabilities and establishing MBZUAI as a global leader in AI research and innovation. • Perform all other duties as reasonably directed by the line manager that are commensurate with these functional objectives. Qualifications: Required: • Ph.D. in CS, EE or CSEE with 1+ years working experience, OR • Masters in CS, EE or CSEE or equivalent experience with 2+ year working experience • Strong background in parallel computing • Hands-on experience in system level coding • Large-scale machine learning experience Company: Official account of Mohamed bin Zayed University of Artificial Intelligence. Dedicated to research, innovation, and empowering brilliant minds in AI. Founded in 2019, the company is headquartered in Abu Dhabi, Abu Dhabi, ARE, with a team of 51-200 employees. The company is currently Growth Stage. MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) has a track record of offering H1B sponsorships. Seniority level Seniority levelEntry level Employment type Employment typeFull-time Job function IndustriesSoftware Development Referrals increase your chances of interviewing at Jobright.ai by 2x Inferred from the description for this job Medical insurance Vision insurance 401(k) Get notified when a new job is posted. Sign in to set job alerts for "Machine Learning Engineer" roles.AI/ML Engineer (Multiple roles and seniority levels)Software Development Engineer In Test (SDET) San Jose, CA $87,600.00-$186,000.00 5 days ago Sunnyvale, CA $117,000.00-$234,000.00 3 weeks ago San Jose, CA $137,500.00-$236,500.00 1 week ago San Jose, CA $120,700.00-$228,600.00 1 week ago New Grads 2025 - Software Engineer - Computer Vision/Deep Learning San Jose, CA $120,000.00-$165,000.00 9 months ago San Francisco Bay Area $155,000.00-$265,500.00 2 weeks ago Sunnyvale, CA $117,000.00-$234,000.00 4 days ago San Jose, CA $120,700.00-$228,600.00 1 week ago San Jose, CA $137,500.00-$236,500.00 1 month ago New Grads 2025 - Software Engineer, Algorithm San Jose, CA $120,000.00-$165,000. months ago San Jose, CA $137,500.00-$236,500.00 3 weeks ago Software Engineer, AI Platform - New Grad Mountain View, CA $145,000.00-$170,000.00 1 week ago Machine Learning Engineer (I, II, or Sr.) Sunnyvale, CA $158,200.00-$185,000.00 4 weeks ago Machine Learning Engineer, Early Stage Project Mountain View, CA $136,000.00-$185,000.00 2 weeks ago San Jose, CA $120,700.00-$228,600.00 1 week ago San Jose, CA $130,000.00-$200,000.00 4 days ago San Jose, CA $169,500.00-$291,500.00 1 week ago Sunnyvale, CA $167,000.00-$185,500.00 1 month ago Sunnyvale, CA $167,000.00-$185,500.00 4 weeks ago San Jose, CA $120,000.00-$240,000.00 5 months ago San Jose, CA $130,000.00-$182,000. months ago Redwood City, CA $167,200.00-$250,800.00 2 weeks ago San Jose, CA $118,657.00-$187,200.00 2 weeks ago San Jose, CA $137,500.00-$236,500.00 4 months ago New Grads 2025 - General Software Engineer San Jose, CA $120,000.00-$165,000.00 6 months ago San Jose, CA $142,700.00-$257,600.00 1 week ago We're unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Lead Site Reliability Engineer

Bridge Defense Washington, Washington DC

Join to apply for the Lead Site Reliability Engineer role at Bridge Defense About Bridge Defense Bridge Defense is redefining how modern defense technology is delivered. Based in Washington, D.C., we are built for the dynamic mission environment facing the Department of Defense, the Intelligence Community, and federal law enforcement agencies. We provide full stream national security solutions that combine secure infrastructure, cleared talent, and mission ready software to meet evolving defense challenges. Our services include secure software development in classified environments and the design and implementation of advanced IT and cybersecurity capabilities ranging from secure cloud architectures and enterprise infrastructure to data center operations, scientific analysis, and cutting edge cyber defense. We are led by technologists and veterans with firsthand mission experience, which enables us to understand both the operational realities and the innovation needed to succeed. Our approach is agile and outcome based, delivering results in weeks rather than months whenever possible. At Bridge Defense we value people, integrity, and excellence. We foster an environment where innovation thrives in support of traditional mission requirements. Our team members receive competitive compensation, robust benefits, professional development and certification opportunities, and clear paths for growth while working on the nation's most critical projects. Core Values Innovation & Responsiveness: We push beyond legacy models with efficient, tech led solutions built to scale and evolve. Trusted Performance: Security, compliance, and deep experience in delivering to demanding environments guides all we do. Mission Focused Expertise: From veteran leadership to cleared engineers, our people understand both the technology and the mission. About The Role As the Lead Site Reliability Engineer for our ComputeBridge Engagement, you'll be responsible for the reliability, scalability, and performance of one of the largest hardware and AI infrastructure efforts in the U.S. defense sector. You will lead the deployment, management, and automation of a high performance computing mesh across multiple secure environments, ensuring operational excellence and mission continuity for a 9 figure government program. This is a hands on engineering leadership role that bridges physical infrastructure and modern DevOps automation, ideal for someone who thrives at the intersection of hardware systems, distributed computing, and AI/ML workflows. What You'll Do Lead infrastructure design, deployment, and operations for ComputeBridge hardware clusters across secure and distributed environments Install and configure physical systems, including high density GPU servers, networking gear, and storage arrays Build and deploy secure Linux images and containerized workloads using OpenShift and other orchestration platforms Develop and manage automation pipelines for provisioning, configuration management, and monitoring using modern DevOps toolchains (Ansible, Terraform, etc.) Operate and maintain distributed networking meshes across multiple classified and unclassified domains Implement and manage out of band management tools (IMPI, iDRAC, BMC, etc.) for remote troubleshooting and control Integrate and optimize NVIDIA GPU infrastructure for AI/ML training and inference workloads Collaborate with mission engineers, software teams, and government operators to ensure system readiness and performance Provide on site technical leadership for deployments, troubleshooting, and continuous improvement Mentor junior engineers and establish operational best practices across the ComputeBridge program as the contract grows What You'll Bring 3+ years of experience in site reliability, systems engineering, or hardware operations roles Deep expertise with physical infrastructure: server racking, cabling, diagnostics, and troubleshooting Strong experience with Linux systems administration, imaging, and automated deployment Hands on experience managing large scale clusters or distributed systems in OpenShift or Kubernetes environments Familiarity with DevOps automation (Ansible, Terraform, CI/CD pipelines) Experience configuring and managing networking and mesh architectures Direct experience with NVIDIA GPUs, CUDA, and related AI/ML frameworks Proficiency with out of band management and IMPI/iDRAC tooling Certifications: Linux+ and Security+ (required or in progress) Excellent communication, documentation, and problem solving skills Clearance: Active TS/SCI required or ability to obtain Bonus Points For Experience operating in secure DoD or intelligence environments Familiarity with Palantir platforms or other government data systems Prior experience supporting AI/ML infrastructure in production or tactical settings Experience with performance tuning and monitoring of HPC or GPU accelerated clusters General Factors Depending on project requirements, may be required to work within a compressed schedule; overtime should be expected when schedules demand it. Willing to travel, if needed. No Relocation. Why Bridge Defense Shape how advanced computing supports national security missions at scale Lead engineering for a major government program with direct mission impact Competitive compensation, benefits, and growth opportunities in a mission driven environment Bridge Defense is committed to building a collaborative and mission focused team. Bridge Defense reserves the right to modify job duties or requirements at any time. Employment with Bridge Defense is at will. Candidates must be eligible to work in the United States and complete any required background checks or security clearance processes as a condition of employment. Seniority level: Mid Senior level Employment type: Full time Job function: Engineering and Information Technology Industries: Defense and Space Manufacturing Referrals increase your chances of interviewing at Bridge Defense by 2x

04/02/2026

Full time

Join to apply for the Lead Site Reliability Engineer role at Bridge Defense About Bridge Defense Bridge Defense is redefining how modern defense technology is delivered. Based in Washington, D.C., we are built for the dynamic mission environment facing the Department of Defense, the Intelligence Community, and federal law enforcement agencies. We provide full stream national security solutions that combine secure infrastructure, cleared talent, and mission ready software to meet evolving defense challenges. Our services include secure software development in classified environments and the design and implementation of advanced IT and cybersecurity capabilities ranging from secure cloud architectures and enterprise infrastructure to data center operations, scientific analysis, and cutting edge cyber defense. We are led by technologists and veterans with firsthand mission experience, which enables us to understand both the operational realities and the innovation needed to succeed. Our approach is agile and outcome based, delivering results in weeks rather than months whenever possible. At Bridge Defense we value people, integrity, and excellence. We foster an environment where innovation thrives in support of traditional mission requirements. Our team members receive competitive compensation, robust benefits, professional development and certification opportunities, and clear paths for growth while working on the nation's most critical projects. Core Values Innovation & Responsiveness: We push beyond legacy models with efficient, tech led solutions built to scale and evolve. Trusted Performance: Security, compliance, and deep experience in delivering to demanding environments guides all we do. Mission Focused Expertise: From veteran leadership to cleared engineers, our people understand both the technology and the mission. About The Role As the Lead Site Reliability Engineer for our ComputeBridge Engagement, you'll be responsible for the reliability, scalability, and performance of one of the largest hardware and AI infrastructure efforts in the U.S. defense sector. You will lead the deployment, management, and automation of a high performance computing mesh across multiple secure environments, ensuring operational excellence and mission continuity for a 9 figure government program. This is a hands on engineering leadership role that bridges physical infrastructure and modern DevOps automation, ideal for someone who thrives at the intersection of hardware systems, distributed computing, and AI/ML workflows. What You'll Do Lead infrastructure design, deployment, and operations for ComputeBridge hardware clusters across secure and distributed environments Install and configure physical systems, including high density GPU servers, networking gear, and storage arrays Build and deploy secure Linux images and containerized workloads using OpenShift and other orchestration platforms Develop and manage automation pipelines for provisioning, configuration management, and monitoring using modern DevOps toolchains (Ansible, Terraform, etc.) Operate and maintain distributed networking meshes across multiple classified and unclassified domains Implement and manage out of band management tools (IMPI, iDRAC, BMC, etc.) for remote troubleshooting and control Integrate and optimize NVIDIA GPU infrastructure for AI/ML training and inference workloads Collaborate with mission engineers, software teams, and government operators to ensure system readiness and performance Provide on site technical leadership for deployments, troubleshooting, and continuous improvement Mentor junior engineers and establish operational best practices across the ComputeBridge program as the contract grows What You'll Bring 3+ years of experience in site reliability, systems engineering, or hardware operations roles Deep expertise with physical infrastructure: server racking, cabling, diagnostics, and troubleshooting Strong experience with Linux systems administration, imaging, and automated deployment Hands on experience managing large scale clusters or distributed systems in OpenShift or Kubernetes environments Familiarity with DevOps automation (Ansible, Terraform, CI/CD pipelines) Experience configuring and managing networking and mesh architectures Direct experience with NVIDIA GPUs, CUDA, and related AI/ML frameworks Proficiency with out of band management and IMPI/iDRAC tooling Certifications: Linux+ and Security+ (required or in progress) Excellent communication, documentation, and problem solving skills Clearance: Active TS/SCI required or ability to obtain Bonus Points For Experience operating in secure DoD or intelligence environments Familiarity with Palantir platforms or other government data systems Prior experience supporting AI/ML infrastructure in production or tactical settings Experience with performance tuning and monitoring of HPC or GPU accelerated clusters General Factors Depending on project requirements, may be required to work within a compressed schedule; overtime should be expected when schedules demand it. Willing to travel, if needed. No Relocation. Why Bridge Defense Shape how advanced computing supports national security missions at scale Lead engineering for a major government program with direct mission impact Competitive compensation, benefits, and growth opportunities in a mission driven environment Bridge Defense is committed to building a collaborative and mission focused team. Bridge Defense reserves the right to modify job duties or requirements at any time. Employment with Bridge Defense is at will. Candidates must be eligible to work in the United States and complete any required background checks or security clearance processes as a condition of employment. Seniority level: Mid Senior level Employment type: Full time Job function: Engineering and Information Technology Industries: Defense and Space Manufacturing Referrals increase your chances of interviewing at Bridge Defense by 2x

Remote HPC/AI Solutions Architect for Enterprise Compute

Cambridge Computer

A leading IT company in Massachusetts is seeking an HPC/AI Solutions Architect to design high-performance computing solutions. The role involves consulting with clients, recommending multi-vendor solutions, and requires 10+ years of technical experience in the IT sector. Candidates should possess excellent communication skills and be able to work independently. The company offers a competitive salary along with various benefits including multiple health insurance options and a 401(k) plan.

04/02/2026

Full time

A leading IT company in Massachusetts is seeking an HPC/AI Solutions Architect to design high-performance computing solutions. The role involves consulting with clients, recommending multi-vendor solutions, and requires 10+ years of technical experience in the IT sector. Candidates should possess excellent communication skills and be able to work independently. The company offers a competitive salary along with various benefits including multiple health insurance options and a 401(k) plan.

Senior GPU Firmware Engineer

Advanced Micro Devices, Inc. Santa Clara, California

SENIOR GPU FIRMWARE ENGINEER WHAT YOU DO AT AMD CHANGES EVERYTHING At AMD, our mission is to build great products that accelerate next generation computing experiences-from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges-striving for execution excellence, while being direct, humble, collaborative and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career. THE ROLE Join AMD's Datacenter firmware application team as a Firmware Application Engineer, supporting our GPU customers across Cloud, HPC and OEM segments. In this customer centric role you will collaborate with external OEM partners, internal development and validation teams and cross functional stakeholders to bring next generation server platforms to market powered by AMD's Instinct Accelerators and ensure their successful deployment in customer data centers. THE PERSON An ideal candidate should be familiar with embedded/firmware development, GPU driver/runtime, OS kernel internals, microcontroller fundamentals, hardware power/frequency controls, etc. He/she should be comfortable performing quantitative analysis of workload, pinpoint issues and drive improvements together with the upper layer stack to achieve the ultimate performance. You are a hands on technical problem solver who thrives at the intersection of hardware and software. You enjoy collaborating directly with customers and internal engineering teams to turn complex system challenges into actionable solutions. You'll excel in this role if you: Are energized by customer engagement and technical troubleshooting. Have strong analytical instincts and a structured approach to problem solving. Communicate clearly and proactively across technical and non technical audiences. Enjoy collaborating across hardware, firmware and software disciplines. Bring curiosity, creativity and persistence to complex engineering challenges. KEY RESPONSIBILITIES Manage technical interaction with OEM/ODM partners to enable deployment of AMD Instinct Accelerators in partner systems. Work alongside hardware and upper software layers to co optimize the whole AI software stack. Design and build tools for better collecting and presenting GPU performance details correlating to low level hardware characteristics. Support partners in the bring up and validation of AMD Instinct GPUs in their system, guide partners on use of AMD tools, qualification test methods and analysis of test results. Lead the debug of partner/customer issues (firmware, HW, driver), working with a cross functional team and driving root cause investigations. Work with partners on the development of manufacturing/screen tests to ensure reliability at scale. Understand partner requirements and schedule, identify gaps in AMD offering and work with other stakeholders to close them. Author design guidelines, technical presentations and training material. Provide recommendations to improve customer experience with our SW and HW. PREFERRED EXPERIENCE Experience with firmware development. Experience with embedded software development. Experience with power management and control theory. Experience working on system level reliability and resiliency features. Familiarity with OS kernel/driver internals. Familiarity with GPU architectures and runtimes. Familiarity with microcontroller fundamentals (caches, buses, memory controllers, DMA, etc.). Strong C/C++ programming skills. Strong knowledge in PC/server architecture and interfaces, experience with system level debug. Strong system level debugging skills with hands on experiences in system bring up, HW debug and performance optimizations on various system architectures. Understanding and experience working with Enterprise Linux environment (Ubuntu, CentOS/RHEL and SLES). Excellent oral and written communication skills to communicate technical results clearly and accurately. Experience or knowledge of server firmware/BIOS settings, boot process, server monitoring and management SW. Solid knowledge of Shell/BASH, C/C++, Python, or other framework. Experience with OpenCL, CUDA or ROCm is a plus. PREFERRED ACADEMIC CREDENTIALS BS/MS (Computer Science, Computer Engineering, Electrical Engineering or related equivalent). LOCATION Santa Clara, CA Benefits offered are described: AMD benefits at a glance. AMD does not accept unsolicited resumes from headhunters, recruitment agencies or fee based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.

04/02/2026

Full time

SENIOR GPU FIRMWARE ENGINEER WHAT YOU DO AT AMD CHANGES EVERYTHING At AMD, our mission is to build great products that accelerate next generation computing experiences-from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges-striving for execution excellence, while being direct, humble, collaborative and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career. THE ROLE Join AMD's Datacenter firmware application team as a Firmware Application Engineer, supporting our GPU customers across Cloud, HPC and OEM segments. In this customer centric role you will collaborate with external OEM partners, internal development and validation teams and cross functional stakeholders to bring next generation server platforms to market powered by AMD's Instinct Accelerators and ensure their successful deployment in customer data centers. THE PERSON An ideal candidate should be familiar with embedded/firmware development, GPU driver/runtime, OS kernel internals, microcontroller fundamentals, hardware power/frequency controls, etc. He/she should be comfortable performing quantitative analysis of workload, pinpoint issues and drive improvements together with the upper layer stack to achieve the ultimate performance. You are a hands on technical problem solver who thrives at the intersection of hardware and software. You enjoy collaborating directly with customers and internal engineering teams to turn complex system challenges into actionable solutions. You'll excel in this role if you: Are energized by customer engagement and technical troubleshooting. Have strong analytical instincts and a structured approach to problem solving. Communicate clearly and proactively across technical and non technical audiences. Enjoy collaborating across hardware, firmware and software disciplines. Bring curiosity, creativity and persistence to complex engineering challenges. KEY RESPONSIBILITIES Manage technical interaction with OEM/ODM partners to enable deployment of AMD Instinct Accelerators in partner systems. Work alongside hardware and upper software layers to co optimize the whole AI software stack. Design and build tools for better collecting and presenting GPU performance details correlating to low level hardware characteristics. Support partners in the bring up and validation of AMD Instinct GPUs in their system, guide partners on use of AMD tools, qualification test methods and analysis of test results. Lead the debug of partner/customer issues (firmware, HW, driver), working with a cross functional team and driving root cause investigations. Work with partners on the development of manufacturing/screen tests to ensure reliability at scale. Understand partner requirements and schedule, identify gaps in AMD offering and work with other stakeholders to close them. Author design guidelines, technical presentations and training material. Provide recommendations to improve customer experience with our SW and HW. PREFERRED EXPERIENCE Experience with firmware development. Experience with embedded software development. Experience with power management and control theory. Experience working on system level reliability and resiliency features. Familiarity with OS kernel/driver internals. Familiarity with GPU architectures and runtimes. Familiarity with microcontroller fundamentals (caches, buses, memory controllers, DMA, etc.). Strong C/C++ programming skills. Strong knowledge in PC/server architecture and interfaces, experience with system level debug. Strong system level debugging skills with hands on experiences in system bring up, HW debug and performance optimizations on various system architectures. Understanding and experience working with Enterprise Linux environment (Ubuntu, CentOS/RHEL and SLES). Excellent oral and written communication skills to communicate technical results clearly and accurately. Experience or knowledge of server firmware/BIOS settings, boot process, server monitoring and management SW. Solid knowledge of Shell/BASH, C/C++, Python, or other framework. Experience with OpenCL, CUDA or ROCm is a plus. PREFERRED ACADEMIC CREDENTIALS BS/MS (Computer Science, Computer Engineering, Electrical Engineering or related equivalent). LOCATION Santa Clara, CA Benefits offered are described: AMD benefits at a glance. AMD does not accept unsolicited resumes from headhunters, recruitment agencies or fee based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.

Senior Network Engineer

Together AI San Francisco, California

As a Senior Network Engineer at Together, you are responsible for designing, implementing, and maintaining our network infrastructure to ensure seamless connectivity and optimal performance for all user-facing services and production systems. As both a strategic planner and a hands on engineer, you apply sound networking principles, operational discipline, and advanced automation to our network environments. You specialize in networking systems-including routing, switching, network security, and protocols-implementing best practices for availability, reliability, and scalability. You have a keen interest in network design, optimization, and emerging technologies in HPC based data center networking. Outstanding problem solving abilities and a comprehensive understanding of fundamental network theory are also critical to your success. Requirements 8+ years of professional experience building, managing, and supporting large scale hybrid data center networks (excluding enterprise networks). High level of proficiency with TCP/IP networking architecture and technologies such as BGP, OSPF, VXLAN, EVPN, and QoS. Experience developing network automation pipelines using Python, Ansible, or other languages/tools utilized in infrastructure automation. Proficient in using tools such as Wireshark, tcpdump, nmap, MTR, and curl to identify connectivity issues, latency problems, and network bottlenecks. Experience designing and supporting multi tenant networks Hands on experience deploying and supporting network devices from Cisco, Arista, Juniper, and Mellanox. Experience working with cloud networks such as AWS, GCP, and Azure. Solid experience working in and troubleshooting within a Linux environment. Responsibilities Design, deploy, manage and maintain global multi vendor, multi protocol high performance compute networks. Analyze data to diagnose and identify root causes to network issues to minimize downtime. Evaluate and recommend network technologies, hardware, and software solutions. Participate in design reviews to ensure the proposed network architecture aligns with business needs and is optimized for performance, scalability, and reliability. Manage relationships with external vendors and partners to test and verify hardware and software selections. Develop, and deploy systems and tools to keep all networks running reliably and efficiently. Establish and implement industry best practices and contribute to the design of new scalable network solutions. Ensure compliance with IT governance standards and best practices. Lead projects to address complex technical challenges, directly contributing to roadmaps and partner alongside the best engineers in the industry to develop world class solutions. Preferred Knowledge of RoCE and Infiniband protocols a plus. Experience with Docker, Kubernetes, or Slurm a plus. Understanding of AI training workloads and the demands they exert on networks a plus. About Together AI Together AI is a research driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co designing software, hardware, algorithms, and models. We have contributed to leading open source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full time position is: $190,000 - $250,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

04/02/2026

Full time

As a Senior Network Engineer at Together, you are responsible for designing, implementing, and maintaining our network infrastructure to ensure seamless connectivity and optimal performance for all user-facing services and production systems. As both a strategic planner and a hands on engineer, you apply sound networking principles, operational discipline, and advanced automation to our network environments. You specialize in networking systems-including routing, switching, network security, and protocols-implementing best practices for availability, reliability, and scalability. You have a keen interest in network design, optimization, and emerging technologies in HPC based data center networking. Outstanding problem solving abilities and a comprehensive understanding of fundamental network theory are also critical to your success. Requirements 8+ years of professional experience building, managing, and supporting large scale hybrid data center networks (excluding enterprise networks). High level of proficiency with TCP/IP networking architecture and technologies such as BGP, OSPF, VXLAN, EVPN, and QoS. Experience developing network automation pipelines using Python, Ansible, or other languages/tools utilized in infrastructure automation. Proficient in using tools such as Wireshark, tcpdump, nmap, MTR, and curl to identify connectivity issues, latency problems, and network bottlenecks. Experience designing and supporting multi tenant networks Hands on experience deploying and supporting network devices from Cisco, Arista, Juniper, and Mellanox. Experience working with cloud networks such as AWS, GCP, and Azure. Solid experience working in and troubleshooting within a Linux environment. Responsibilities Design, deploy, manage and maintain global multi vendor, multi protocol high performance compute networks. Analyze data to diagnose and identify root causes to network issues to minimize downtime. Evaluate and recommend network technologies, hardware, and software solutions. Participate in design reviews to ensure the proposed network architecture aligns with business needs and is optimized for performance, scalability, and reliability. Manage relationships with external vendors and partners to test and verify hardware and software selections. Develop, and deploy systems and tools to keep all networks running reliably and efficiently. Establish and implement industry best practices and contribute to the design of new scalable network solutions. Ensure compliance with IT governance standards and best practices. Lead projects to address complex technical challenges, directly contributing to roadmaps and partner alongside the best engineers in the industry to develop world class solutions. Preferred Knowledge of RoCE and Infiniband protocols a plus. Experience with Docker, Kubernetes, or Slurm a plus. Understanding of AI training workloads and the demands they exert on networks a plus. About Together AI Together AI is a research driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co designing software, hardware, algorithms, and models. We have contributed to leading open source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full time position is: $190,000 - $250,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

28 jobs found

Modal Window