Job DescriptionCox Automotive is looking for a Senior Site Reliability Engineer (SRE) to join our Manheim Logistics SRE team. The SRE team is tasked with designing and maintaining AWS infrastructure and deployment pipelines for Manheim Logistics' 15+ development teams. The team has currently standardized on a Docker-based infrastructure solution and is adding functionality to support new development team requests and architectural patterns (such as Lambda, Step Functions, Fargate, etc The SRE team has a strong focus on IaC with Terraform and best practices such as least privilege access, proactive monitoring and alerting, etc. This role will work directly with a release train and help with IaC and SRE activites such as improving monitoring/alerting, defining an error budget, assisting with DevSecOps, etc.As a Senior Site Reliability Engineer at Cox Automotive you will:Take complex problems and come up with a technically reasonable solutionExperience working with and defining SLOs, error budgets, etc.Have innate curiosity about how things workDesign and assist in the authoring of software tools that reliably manage application delivery & performanceDesign and assist in the setup and maintenance of application monitoring and alertingEngage with engineering teams to ensure best practices are implementedImprove predictability and reliability of software releases, workflows, and operating software.Reduce mean time to recovery (MTTR) by helping troubleshoot, monitor, alert, and automating recovery.Solid written communication, problem solving, and process management skillsQualifications:Bachelor's degree in Computer Science or a related discipline and minimum 4 years' experience in a related field. The right candidate could also have a different combination, such as a Master's degree and 2 years' experience; a and up to 1 year of experience; or 16 years' experience in a related field.Applicants must currently be authorized to work in the United States for any employer without current or future sponsorship. No OPT, CPT, STEM/OPT or visa sponsorship now or in future.Minimum 4 years in software development and architecture/solutioning experience.Strong automation experience- testing, deploying, monitoring, etc.Experience with TerraformExperience with Amazon AWS technologies especially: ECS and LambdaMonitoring/observability tools such as: New Relic, Splunk, PagerDutyExperience with agile development, continuous integration and automated testingPreferred Skills:Broad AWS platform skills including Cognito, WAF, Elasticache (Redis), Elasticsearch, SNS, SQS, S3, Systems ManagerExperience automating Terraform at scaleExperience with Database Server infrastructure (RDS, MySQL, Postgres, etcNET core development experienceGitHub ActionsExperience with Github, docker, and Linux adminstration experienceUSD 101 100.00 per yearCompensation:Compensation includes a base salary of $101,500.00 - $169,100.00. The base salary may vary within the anticipated base pay range based on factors such as the ultimate location of the position and the selected candidate's knowledge, skills, and abilities. Position may be eligible for additional compensation that may include an incentive program.Benefits:The Company offers eligible employees the flexibility to take as much vacation with pay as they deem consistent with their duties, the company's needs, and its obligations; seven paid holidays throughout the calendar year; and up to 160 hours of paid wellness annually for their own wellness or that of family members. Employees are also eligible for additional paid time off in the form of bereavement leave, time off to vote, jury duty leave, volunteer time off, military leave, and parental leave.
01/15/2026
Job DescriptionCox Automotive is looking for a Senior Site Reliability Engineer (SRE) to join our Manheim Logistics SRE team. The SRE team is tasked with designing and maintaining AWS infrastructure and deployment pipelines for Manheim Logistics' 15+ development teams. The team has currently standardized on a Docker-based infrastructure solution and is adding functionality to support new development team requests and architectural patterns (such as Lambda, Step Functions, Fargate, etc The SRE team has a strong focus on IaC with Terraform and best practices such as least privilege access, proactive monitoring and alerting, etc. This role will work directly with a release train and help with IaC and SRE activites such as improving monitoring/alerting, defining an error budget, assisting with DevSecOps, etc.As a Senior Site Reliability Engineer at Cox Automotive you will:Take complex problems and come up with a technically reasonable solutionExperience working with and defining SLOs, error budgets, etc.Have innate curiosity about how things workDesign and assist in the authoring of software tools that reliably manage application delivery & performanceDesign and assist in the setup and maintenance of application monitoring and alertingEngage with engineering teams to ensure best practices are implementedImprove predictability and reliability of software releases, workflows, and operating software.Reduce mean time to recovery (MTTR) by helping troubleshoot, monitor, alert, and automating recovery.Solid written communication, problem solving, and process management skillsQualifications:Bachelor's degree in Computer Science or a related discipline and minimum 4 years' experience in a related field. The right candidate could also have a different combination, such as a Master's degree and 2 years' experience; a and up to 1 year of experience; or 16 years' experience in a related field.Applicants must currently be authorized to work in the United States for any employer without current or future sponsorship. No OPT, CPT, STEM/OPT or visa sponsorship now or in future.Minimum 4 years in software development and architecture/solutioning experience.Strong automation experience- testing, deploying, monitoring, etc.Experience with TerraformExperience with Amazon AWS technologies especially: ECS and LambdaMonitoring/observability tools such as: New Relic, Splunk, PagerDutyExperience with agile development, continuous integration and automated testingPreferred Skills:Broad AWS platform skills including Cognito, WAF, Elasticache (Redis), Elasticsearch, SNS, SQS, S3, Systems ManagerExperience automating Terraform at scaleExperience with Database Server infrastructure (RDS, MySQL, Postgres, etcNET core development experienceGitHub ActionsExperience with Github, docker, and Linux adminstration experienceUSD 101 100.00 per yearCompensation:Compensation includes a base salary of $101,500.00 - $169,100.00. The base salary may vary within the anticipated base pay range based on factors such as the ultimate location of the position and the selected candidate's knowledge, skills, and abilities. Position may be eligible for additional compensation that may include an incentive program.Benefits:The Company offers eligible employees the flexibility to take as much vacation with pay as they deem consistent with their duties, the company's needs, and its obligations; seven paid holidays throughout the calendar year; and up to 160 hours of paid wellness annually for their own wellness or that of family members. Employees are also eligible for additional paid time off in the form of bereavement leave, time off to vote, jury duty leave, volunteer time off, military leave, and parental leave.
About the TeamWe're building a Reliability Engineering team that architects and scales intelligent automation platforms for reliability at Cox Automotive. This isn't traditional operations-we write code that prevents incidents, build AI-driven response systems, and create platforms that make our engineering organization more resilient. This is a ground-floor opportunity within our Center of Excellence to shape the future of how we respond to, learn from, and ultimately prevent incidents across our complex cloud-based services.About the PositionYou are a software engineer with a passion for operations who wants to build intelligent incident response automation at scale. Imagine writing code that prevents incidents before they happen, building AI-driven response systems that learn from every outage, and creating platforms that empower engineers across the organization to build more resilient systems.Ambitious, independent self-starter who finds joy in exploring and evolving reliability solutions with engineering teams. You're a quick study who gets excited about tackling challenging operational problems with software and seeing the impact of your work across the enterprise.As a Senior SRE at Cox Automotive you could:Build automation that reduces toil and empowers engineering teams. Create tools and platforms that help teams understand and improve their system reliability. Reimagine how we learn from incidents and turn insights into preventive measures.Experiment with new approaches to observability, monitoring, and alerting. Bring your engineering expertise to complex production challenges. Explore how AI can transform incident detection, triage, and response. Partner with teams across the organization to review & analyze incidents and solve reliability problems at scale. Drive technical conversations that shape how Cox Automotive builds resilient systems. Turn operational pain points into engineering opportunities.Define what modern incident response engineering looks like for our organization. Qualifications:Professional experience with static languages (Java, C#, Go) and dynamic languages (Python, Ruby, JavaScript) and understand the tradeoffs of each.Distributed systems expertise and understanding of failure modes. Experience building internal platforms, developer tools, or automation that scales. Git/version control and CI/CD pipeline (link removed)frastructure as code and API design experience. Track record eliminating toil through intelligent automation. Production ownership experience (on-call, incident response, observability Systems thinking mindset-understanding how components interact at scale. Eager to dig into problems and bring proposed solutions to group discussion. Open to feedback and able to creatively adapt multiple ideas into solutions. Strong technical writing including high and low-level diagramming techniques. Analytical skills and careful attention to detail.Bachelor's degree in a related discipline and 4 years' experience in a related field. The right candidate could also have a different combination, such as a master's degree and 2 years' experience; a Ph.D. and up to 1 year of experience; or 16 years' experience in a related field.Pluses: Chaos engineering, lean methodologies, open-source contributions, public speaking skills.Why This Role Is DifferentYou're Building Product: Own a roadmap, write design docs, ship features-not just keeping lights on.Ground Floor: Shape SRE practices that ensure reliability, availability, and performance across Cox Automotive.AI at the Forefront: Work with cutting-edge LLM technology to solve real production problems.Leadership Path: Grow into technical acumen, work with leadership across all levels, shape reliability strategy.About the CompanyCox Automotive Inc. is transforming the way the world buys, sells, owns and uses cars with industry-leading digital marketing, software, financial, wholesale and e-commerce solutions for consumers, dealers, manufacturers and the overall automotive ecosystem worldwide. The global company has over 30,000 team members in more than 200 locations and is partner to more than 40,000 auto dealers, as well as most major automobile manufacturers. Cox Automotive is a subsidiary of Cox Enterprises, Inc., with revenues of $18 billion.USD 101 100.00 per yearCompensation:Compensation includes a base salary of $101,500.00 - $169,100.00. The base salary may vary within the anticipated base pay range based on factors such as the ultimate location of the position and the selected candidate's knowledge, skills, and abilities. Position may be eligible for additional compensation that may include an incentive program.Benefits:The Company offers eligible employees the flexibility to take as much vacation with pay as they deem consistent with their duties, the company's needs, and its obligations; seven paid holidays throughout the calendar year; and up to 160 hours of paid wellness annually for their own wellness or that of family members. Employees are also eligible for additional paid time off in the form of bereavement leave, time off to vote, jury duty leave, volunteer time off, military leave, and parental leave.
01/14/2026
About the TeamWe're building a Reliability Engineering team that architects and scales intelligent automation platforms for reliability at Cox Automotive. This isn't traditional operations-we write code that prevents incidents, build AI-driven response systems, and create platforms that make our engineering organization more resilient. This is a ground-floor opportunity within our Center of Excellence to shape the future of how we respond to, learn from, and ultimately prevent incidents across our complex cloud-based services.About the PositionYou are a software engineer with a passion for operations who wants to build intelligent incident response automation at scale. Imagine writing code that prevents incidents before they happen, building AI-driven response systems that learn from every outage, and creating platforms that empower engineers across the organization to build more resilient systems.Ambitious, independent self-starter who finds joy in exploring and evolving reliability solutions with engineering teams. You're a quick study who gets excited about tackling challenging operational problems with software and seeing the impact of your work across the enterprise.As a Senior SRE at Cox Automotive you could:Build automation that reduces toil and empowers engineering teams. Create tools and platforms that help teams understand and improve their system reliability. Reimagine how we learn from incidents and turn insights into preventive measures.Experiment with new approaches to observability, monitoring, and alerting. Bring your engineering expertise to complex production challenges. Explore how AI can transform incident detection, triage, and response. Partner with teams across the organization to review & analyze incidents and solve reliability problems at scale. Drive technical conversations that shape how Cox Automotive builds resilient systems. Turn operational pain points into engineering opportunities.Define what modern incident response engineering looks like for our organization. Qualifications:Professional experience with static languages (Java, C#, Go) and dynamic languages (Python, Ruby, JavaScript) and understand the tradeoffs of each.Distributed systems expertise and understanding of failure modes. Experience building internal platforms, developer tools, or automation that scales. Git/version control and CI/CD pipeline (link removed)frastructure as code and API design experience. Track record eliminating toil through intelligent automation. Production ownership experience (on-call, incident response, observability Systems thinking mindset-understanding how components interact at scale. Eager to dig into problems and bring proposed solutions to group discussion. Open to feedback and able to creatively adapt multiple ideas into solutions. Strong technical writing including high and low-level diagramming techniques. Analytical skills and careful attention to detail.Bachelor's degree in a related discipline and 4 years' experience in a related field. The right candidate could also have a different combination, such as a master's degree and 2 years' experience; a Ph.D. and up to 1 year of experience; or 16 years' experience in a related field.Pluses: Chaos engineering, lean methodologies, open-source contributions, public speaking skills.Why This Role Is DifferentYou're Building Product: Own a roadmap, write design docs, ship features-not just keeping lights on.Ground Floor: Shape SRE practices that ensure reliability, availability, and performance across Cox Automotive.AI at the Forefront: Work with cutting-edge LLM technology to solve real production problems.Leadership Path: Grow into technical acumen, work with leadership across all levels, shape reliability strategy.About the CompanyCox Automotive Inc. is transforming the way the world buys, sells, owns and uses cars with industry-leading digital marketing, software, financial, wholesale and e-commerce solutions for consumers, dealers, manufacturers and the overall automotive ecosystem worldwide. The global company has over 30,000 team members in more than 200 locations and is partner to more than 40,000 auto dealers, as well as most major automobile manufacturers. Cox Automotive is a subsidiary of Cox Enterprises, Inc., with revenues of $18 billion.USD 101 100.00 per yearCompensation:Compensation includes a base salary of $101,500.00 - $169,100.00. The base salary may vary within the anticipated base pay range based on factors such as the ultimate location of the position and the selected candidate's knowledge, skills, and abilities. Position may be eligible for additional compensation that may include an incentive program.Benefits:The Company offers eligible employees the flexibility to take as much vacation with pay as they deem consistent with their duties, the company's needs, and its obligations; seven paid holidays throughout the calendar year; and up to 160 hours of paid wellness annually for their own wellness or that of family members. Employees are also eligible for additional paid time off in the form of bereavement leave, time off to vote, jury duty leave, volunteer time off, military leave, and parental leave.
The Site Reliability Engineer - Incident Response is a critical enterprise-level role responsible for accelerating incident resolution and enhancing the overall incident management process. This individual partners with engineering teams during active incidents to troubleshoot issues using monitoring and logging tools, and post-incident, delivers executive-level summaries that clearly communicate impact, root cause, and resolution. The SRE - Incident Response also plays a key role in analyzing incident response effectiveness and identifying opportunities for systemic improvements.Core CompetenciesEngineering/Tooling: Demonstrates the ability to design, build, and maintain engineering solutions and tools that enhance reliability, automate incident response, and reduce operational (link removed)cident Troubleshooting: Skilled in interpreting logs, metrics, and traces to assist in identifying root causes during live incidents.Monitoring & Observability: Proficient in tools such as Datadog, Splunk, New Relic, or similar platforms.AI Centric Engineering: Effectively leverages artificial intelligence (AI) and machine learning (ML) tools to automate, optimize, and enhance daily engineering and incident response tasksExecutive Communication: Ability to distill complex technical issues into concise, business-relevant summaries for senior leadership.Analytical Rigor: Strong attention to detail in validating incident data and identifying trends or gaps in response.DevOps & Architecture Knowledge: Understanding full-stack systems, CI/CD pipelines, caching, scaling, and cloud-native infrastructure.Metrics & Reporting: Capable of calculating and interpreting key metrics like MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to ResolveKey Responsibilities of This RoleHere's how it typically looks when not tied to active on-call:Post-Incident Review DevelopmentDraft and deliver executive summaries post-incidentDevelop and coach teams on blameless postmortems.Create templates, train facilitators, and help guide root cause analysis (e.g., 5 Whys, fishbone diagramsMaintain a central library of learnings and cross-cutting (link removed)cident Process ImprovementActively support engineering teams during incidents by helping diagnose and resolve issues quicklyNavigate and analyze data from observability platforms to make informed inferences about root causesAnalyze the effectiveness of incident response to identify systemic reliability gaps.Standardize incident response workflows (incident roles, comms, escalation pathsCreate or refine runbooks, incident command frameworks, and severity classification guides.Metrics and InsightsBuild dashboards around incident frequency, MTTR, MTTA, and recurrence rates.Use incident data to drive reliability of OKRs or engineering investments.Tooling & AI SolutionsPartner with engineering teams to identify repetitive or high-impact tasks suitable for automation.Develop, implement, and continuously improve custom scripts, bots, and AI-driven workflows for monitoring, alerting, and incident triage.Evaluate and integrate emerging AI/ML technologies to optimize detection, root cause analysis, and reporting.Ensure all tools and automations are secure, maintainable, and aligned with organizational standards and SRE best practices.Document and socialize new tools and AI solutions, enabling adoption and knowledge sharing across teams.Cross-Team CollaborationCollaborate with Engineering Managers and Incident Commanders to gather and validate incident dataPartner with product teams, infra, and leadership to socialize reliability best practices.Act as a reliability "consultant" to squads that have impactful incidents.Recommend enhancements to monitoring, alerting, and response processes to reduce future incident impactUSD 99 000.00 per yearCompensation:Compensation includes a base salary of $99,000.00 - $165,000.00. The base salary may vary within the anticipated base pay range based on factors such as the ultimate location of the position and the selected candidate's knowledge, skills, and abilities. Position may be eligible for additional compensation that may include an incentive program.Benefits:The Company offers eligible employees the flexibility to take as much vacation with pay as they deem consistent with their duties, the company's needs, and its obligations; seven paid holidays throughout the calendar year; and up to 160 hours of paid wellness annually for their own wellness or that of family members. Employees are also eligible for additional paid time off in the form of bereavement leave, time off to vote, jury duty leave, volunteer time off, military leave, and parental leave.
12/17/2025
The Site Reliability Engineer - Incident Response is a critical enterprise-level role responsible for accelerating incident resolution and enhancing the overall incident management process. This individual partners with engineering teams during active incidents to troubleshoot issues using monitoring and logging tools, and post-incident, delivers executive-level summaries that clearly communicate impact, root cause, and resolution. The SRE - Incident Response also plays a key role in analyzing incident response effectiveness and identifying opportunities for systemic improvements.Core CompetenciesEngineering/Tooling: Demonstrates the ability to design, build, and maintain engineering solutions and tools that enhance reliability, automate incident response, and reduce operational (link removed)cident Troubleshooting: Skilled in interpreting logs, metrics, and traces to assist in identifying root causes during live incidents.Monitoring & Observability: Proficient in tools such as Datadog, Splunk, New Relic, or similar platforms.AI Centric Engineering: Effectively leverages artificial intelligence (AI) and machine learning (ML) tools to automate, optimize, and enhance daily engineering and incident response tasksExecutive Communication: Ability to distill complex technical issues into concise, business-relevant summaries for senior leadership.Analytical Rigor: Strong attention to detail in validating incident data and identifying trends or gaps in response.DevOps & Architecture Knowledge: Understanding full-stack systems, CI/CD pipelines, caching, scaling, and cloud-native infrastructure.Metrics & Reporting: Capable of calculating and interpreting key metrics like MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to ResolveKey Responsibilities of This RoleHere's how it typically looks when not tied to active on-call:Post-Incident Review DevelopmentDraft and deliver executive summaries post-incidentDevelop and coach teams on blameless postmortems.Create templates, train facilitators, and help guide root cause analysis (e.g., 5 Whys, fishbone diagramsMaintain a central library of learnings and cross-cutting (link removed)cident Process ImprovementActively support engineering teams during incidents by helping diagnose and resolve issues quicklyNavigate and analyze data from observability platforms to make informed inferences about root causesAnalyze the effectiveness of incident response to identify systemic reliability gaps.Standardize incident response workflows (incident roles, comms, escalation pathsCreate or refine runbooks, incident command frameworks, and severity classification guides.Metrics and InsightsBuild dashboards around incident frequency, MTTR, MTTA, and recurrence rates.Use incident data to drive reliability of OKRs or engineering investments.Tooling & AI SolutionsPartner with engineering teams to identify repetitive or high-impact tasks suitable for automation.Develop, implement, and continuously improve custom scripts, bots, and AI-driven workflows for monitoring, alerting, and incident triage.Evaluate and integrate emerging AI/ML technologies to optimize detection, root cause analysis, and reporting.Ensure all tools and automations are secure, maintainable, and aligned with organizational standards and SRE best practices.Document and socialize new tools and AI solutions, enabling adoption and knowledge sharing across teams.Cross-Team CollaborationCollaborate with Engineering Managers and Incident Commanders to gather and validate incident dataPartner with product teams, infra, and leadership to socialize reliability best practices.Act as a reliability "consultant" to squads that have impactful incidents.Recommend enhancements to monitoring, alerting, and response processes to reduce future incident impactUSD 99 000.00 per yearCompensation:Compensation includes a base salary of $99,000.00 - $165,000.00. The base salary may vary within the anticipated base pay range based on factors such as the ultimate location of the position and the selected candidate's knowledge, skills, and abilities. Position may be eligible for additional compensation that may include an incentive program.Benefits:The Company offers eligible employees the flexibility to take as much vacation with pay as they deem consistent with their duties, the company's needs, and its obligations; seven paid holidays throughout the calendar year; and up to 160 hours of paid wellness annually for their own wellness or that of family members. Employees are also eligible for additional paid time off in the form of bereavement leave, time off to vote, jury duty leave, volunteer time off, military leave, and parental leave.