Job Description
Contract: 6-month initial term with conversion potential
Work Authorization: Must be authorized to work in the U.S. without sponsorship requirements
Ellofant is a modern consulting firm built for those who want to do work that actually moves the needle. We help companies navigate change, complexity, and scale through a blend of strategic thinking, trusted technology, and hands-on execution. Our clients rely on us not just for advice, but for building systems, launching products, and driving outcomes that matter. At Ellofant, we value clarity over jargon, momentum over perfection, and people over process. Were looking for curious, driven individuals who want to solve real problems with real impact. If you're excited about challenging whats possible and delivering meaningful change while you're at it, Ellofant might be your next move.
About the Role
We're seeking an experienced Site Reliability Engineer to join our infrastructure resiliency team. In this role, you'll be responsible for ensuring the stability, performance, and reliability of critical systems across diverse technology stacks including mainframe, Windows, and cloud environments (OpenShift, AWS, Azure). You'll work at the intersection of software engineering and operations, driving automation, implementing resiliency patterns, and responding to critical events to maintain exceptional service availability.
What You'll Be Doing
1. Coordinate responses to critical events with application support teams and the Site Reliability Center
2. Triage and respond to alerts generated through BigPanda event correlation platform
3. Assess cross-domain impacts and engage appropriate support teams or escalate as needed
4. Participate in on-call rotations to provide 24x7 coverage for critical systems
5. Conduct blameless post-mortems and root cause analysis to drive continuous improvement
6. Design and implement automated monitoring and alerting systems using Dynatrace, Grafana, Logscale, CrowdStrike, Prometheus, Splunk, Moogsoft, and Datadog
7. Create robust dashboards and implement SLAs/SLOs through comprehensive monitoring
8. Analyze metrics from operating systems and applications to assist in performance tuning and fault detection
9. Develop and implement chaos engineering practices using Litmus, Gremlin, Azure Chaos Studio, and Chaos Mesh
10. Design fault injection experiments to validate system resilience using AWS Resilience Hub
11. Build self-healing capabilities and automated remediation workflows
12. Implement health checks, autoscaling solutions using AWS Lambda, Kubernetes, OpenShift, and Istio service mesh
13. Manage infrastructure across mainframe systems, Windows, RHEL, and cloud platforms (AWS, Azure, OpenShift)
14. Work with containerized environments, event streaming platforms (Kafka), and database systems (Oracle, SQL)
15. Maintain virtualization infrastructure (VMware) and storage systems (NAS)
16. Leverage ServiceNow for incident management, Jira for issue tracking, and CA7 for job scheduling
17. Identify opportunities to improve application stability and evangelize SRE best practices
18. Maintain comprehensive knowledge bases and runbooks in Confluence
19. Mentor junior team members on resiliency patterns and operational excellence
What We're Looking For
1. 3-5 years of relevant experience in site reliability, infrastructure, or DevOps engineering
2. Strong expertise in monitoring and observability tools (Dynatrace, Grafana, Prometheus, Splunk, or similar)
3. Experience with incident management and event correlation platforms (BigPanda, ServiceNow, Moogsoft)
4. Proficiency with Linux/Unix systems (RHEL) and Windows Server environments
5. Hands-on experience with cloud platforms: AWS, Azure, or OpenShift
6. Strong knowledge of containerization and orchestration: Kubernetes, Docker, OpenShift
7. Experience with chaos engineering and fault injection frameworks (Litmus, Gremlin, AWS FIS, Azure Chaos Studio)
8. Solid understanding of networking, database systems (Oracle, SQL), and distributed architectures
9. Experience with event streaming platforms (Kafka) and service mesh technologies (Istio)
10. Familiarity with mainframe systems and legacy infrastructure
11. Experience with infrastructure as code and automation tools
12. Knowledge of job scheduling systems (CA7 or similar) and middleware technologies
13. Proficiency with Jira, Confluence, and ITSM tools
14. Experience working in financial services or other highly regulated industries preferred
15. Relevant certifications valued: AWS/Azure architecture, RHCE, VCP, Kubernetes (CKA/CKAD)
16. Strong analytical thinking, problem-solving abilities, and troubleshooting skills
17. Excellent written and verbal communication skills for cross-functional collaboration
Our Commitment
We're committed to building resilient systems that deliver exceptional reliability and performance at scale. You'll work with cutting-edge resiliency engineering tools and practices, collaborate with talented engineers across domains, and have opportunities to mentor junior team members while continuously advancing your own skills in this evolving field.
Equal Employment Opportunity & Inclusivity
Ellofant is proud to be an Equal Employment Opportunity Employer. We do not discriminate in employment on the basis of race, color, religion, sex, sexual orientation, gender identity or expression, national origin, disability, age, protected veteran status, genetic information, or any other characteristic protected by law. This policy applies to all aspects of employment, including recruitment, hiring, placement, promotion, compensation, benefits, training, termination, and other conditions of employment.
Commitment to Diversity, Equity & Inclusion
We believe diversity, equity, and inclusion are fundamental to innovation, thoughtful client service, and a thriving workplace culture. We welcome and value different perspectives, experiences, and backgrounds including but not limited to race, gender, ethnicity, sexual orientation, disability status, veteran status, and neurodiversity. Accommodations are available upon request during the application and interview process.
Pay Transparency & Benefits
Compensation will be commensurate with your experience, skill set, and job location. Ellofant offers a competitive benefits package, which may include medical and dental coverage, retirement savings plans, paid time off, and professional development support. Salary range will be disclosed to candidates as part of the interview process where permitted.
Application Privacy & Process
Your personal information, including resume, interview feedback, and any background checks, will be collected and used solely for recruitment purposes and handled in accordance with applicable privacy laws. Employment is contingent upon successful completion of verification of eligibility and reference checks. You must be authorized to work in the U.S.
Fraud Awareness
Please be aware of potential recruitment fraud. Ellofant will never request payment or sensitive financial information at any stage of the hiring process. All legitimate communication will come from ellofant. If you receive suspicious outreach claiming to be from Ellofant, please contact us directly.
We appreciate your interest in Ellofant and encourage candidates from all backgrounds who are excited about challenging whats possible to apply.
Dallas stands as one of the nation's premier technology and financial services hubs, offering an exceptional environment for Site Reliability Engineers to thrive. The city's vibrant tech ecosystem is anchored by major corporations, innovative startups, and world-class universities that create a collaborative atmosphere for engineering talent. With no state income tax and a significantly lower cost of living compared to coastal tech centers, Dallas provides an attractive financial proposition for professionals seeking career growth without sacrificing quality of life. The Dallas-Fort Worth metroplex boasts excellent infrastructure, including DFW International Airport providing direct connections to destinations worldwide, making it ideal for both business travel and personal exploration. Beyond work, the city offers diverse neighborhoods ranging from the trendy Uptown and Deep Ellum districts to family-friendly suburbs with top-rated schools. The thriving arts scene, professional sports teams, renowned dining experiences, and year-round outdoor activities ensure there's always something to enjoy outside the office.