Performance & Reliability Engineer
San Antonio, TX: Hybrid
US Citizenship
We are seeking a Performance & Reliability Engineer to hire in support of the EDUCATION- DCC program. This is a great opportunity for someone who enjoys collaborating across teams, solving complex technical challenges, and improving system reliability.
Job Description : Plays a crucial role in maintaining and enhancing the reliability, availability, and performance of our applications and services. You will leverage your expertise in AWS operations, infrastructure as code, and deployment automation to streamline processes, reduce downtime, and improve overall system performance.
Key Responsibilities:
Ensure the reliability, availability, and performance of applications and services through proactive monitoring, incident response, and capacity planning.
Manage and optimize AWS cloud infrastructure to support scalable and resilient application operations.
Develop, implement, and maintain infrastructure as code using tools such as Terraform, CloudFormation, or similar.
Automate deployment processes to ensure consistent and reliable delivery of software updates and infrastructure changes.
Collaborate with development teams to design and implement solutions that enhance system performance and reliability.
Conduct root cause analysis for incidents and implement strategies to prevent recurrence.
Establish and maintain monitoring, alerting, and logging frameworks to ensure visibility into system health and performance.
Participate in on-call rotations to provide 24/7 support for critical systems and applications.
Drive continuous improvement initiatives to enhance operational efficiency and reduce technical debt.
Minimum Qualifications
Job Qualifications:
Strong expertise in AWS cloud services, including EC2, S3, RDS, Lambda, etc.
Proficiency in infrastructure as code tools such as Terraform, CloudFormation, or similar.
Experience with deployment automation tools and frameworks (e.g., Jenkins, Ansible, Puppet, Chef).
Solid understanding of monitoring, alerting, and logging tools (e.g., Dynatrace, Splunk, Prometheus, Grafana, ELK Stack).
Strong scripting and automation skills using languages such as Python, Bash, or PowerShell.
Excellent problem-solving and troubleshooting skills.
Strong communication and collaboration abilities.
Other Job Specific Skills