Job Title: InfraOps Reliability Administrator
Location: Hybrid
Regular/Temporary: Regular
Full/Part Time: Full-Time
Job ID: 60506
Department This position is within FSU's Department of Information Technology Services (ITS)
Click here to see what the current team has to say about this role.
Responsibilities The FSU College of Medicine Infrastructure and Operations team designs, builds, and manages infrastructure and servers to support other IT teams, faculty, staff, researchers, and students within the college. The team leverages the latest in automation and observability solutions to make complex work easier to accomplish. Design, build, automate, and optimize infrastructure using modern tools and site reliability engineering practices. Manage primarily Windows servers in a hybrid cloud environment, with a focus on reliability, observability, security, and continuous improvement. Collaborate across teams and leverage automation, scripting, data-informed decision-making, and self-directed professional development to deliver secure, scalable, and customer-focused solutions.
Infrastructure and configuration as code: Use tools such as Terraform, Azure DevOps, Visual Studio Code, and scripting languages like PowerShell and Bash to manage infrastructure as code (IaC) and configuration as code (CaC), ensuring consistency, repeatability, and auditability of systems. Use observability solutions, such as Elastic, to monitor deployments and support data-informed decisions and rapid experiments, that drive continuous improvement. Work with CI/CD pipelines to automate deployment, validation, and testing processes, ensuring systems are secure by design, mitigate vulnerabilities, and are compliant with security policies and standards. Follow secure coding practices, adhere to coding standards, and leverage version control, automated testing, and test-driven development to produce high-quality, secure, and maintainable code. Use AI-assisted tools to accelerate development, validation, and troubleshooting. Participate in pair programming sessions as appropriate to write code and resolve deployment issues.
Provision and manage server infrastructure: Deploy and manage Windows and Linux servers across a hybrid environment that includes Microsoft Azure and over a dozen geographically dispersed on-premises locations. This includes ensuring that all systems are secure by design, follow zero trust principles, and are scalable, observable, and aligned with business needs. Provision infrastructure with reliability, maintainability, and consistency in mind, and implement observability prior to production to support proactive monitoring and data-informed decisions. Collaborate with cross-functional teams and stakeholders throughout the infrastructure lifecycle to ensure solutions align with customer needs; prioritize high-value work, assess feasibility, and conduct security reviews of new systems and applications; deliver exceptional customer service and maintain clear communication to support successful outcomes.
Automation: Automation is not just a task, it is a mindset and a strategic enabler of reliability, consistency, and scalability. Design and implement solutions that make work easier, reduce manual effort, improve system reliability, and streamline operations across provisioning, configuration, monitoring, and remediation. Use AI, scripting, workflow automation, or robotic process automation (RPA) tools to reduce operational overhead and accelerate delivery. Use observability tools to monitor automation performance, ensure reliability, and identify data-informed opportunities for continuous improvement. Collaborate with peers and stakeholders to prioritize high-value automation opportunities and ensure that solutions are effective, secure, and aligned with business needs.
Network administration: Manage and troubleshoot enterprise-grade network infrastructure, including wireless access points, switches, routers, load balancers, and next-generation firewalls. Diagnose and resolve network issues using packet captures, OS command outputs, diagnostic consoles, logs, or other tools. Leverage network observability tools to make data-informed decisions and identify opportunities for improvement. Implement and maintain security measures to protect data, systems, and network availability. Collaborate with network and security teams to validate new systems and configurations, expand observability, reduce exploitable vulnerabilities, implement security controls, and enhance system resilience and usability for customers.
Documentation and process improvement: Create and maintain clear, concise documentation for knowledge sharing, process repeatability, and operational continuity. Develop system diagrams, deployment guides, and standard operating procedures (SOPs) that support usability, compliance, and reliability. Continuously refine documentation and processes as systems evolve, incorporating feedback and lessons learned. Ensure all procedures align with FSU ITS Security Policies and Standards. Participate in peer reviews to validate documentation for accuracy, clarity, and usability.
Support and incident response: Respond to system alerts, outages, and support requests in accordance with established incident management procedures, collaborating with peers and stakeholders to ensure rapid resolution. Use observability tools to support rapid diagnosis and resolution, and create new monitoring as needed to improve visibility. Participate in post-incident reviews, highlighting key data points and observability insights to identify root causes and opportunities for system or process improvements. Implement improvements to prevent the recurrence of issues and to enhance system reliability. Participate in an on-call rotation, typically one week per month, which includes after-hours support for deployments, changes, or incidents, including on holidays and weekends. Actively work to reduce the need for after-hours assistance by leveraging automated deployment solutions, improving system reliability, and lowering the risk and complexity of changes. Assist with IT security investigations as needed. Ensure incident response processes align with the expectations of IT management, technical teams, and customers.
Professional development: Continuous learning and technical curiosity are key expectations of this role. Complete both assigned and self-directed professional development to stay current with evolving technologies, tools, and practices. Explore technical subjects that interest you, even beyond current projects. Use provided learning platforms, such as LinkedIn Learning. Participate in the ITS Professional Development Bonus Plan by completing manager-approved certifications. Pursue relevant training, certifications, and conferences aligned with team goals, subject to approval. Approved training resources will be paid for by the organization. Research and validate emerging tools, including AI, automation, observability, and other innovations, to assess their value for our organization. Apply a mindset of rapid experimentation using data to guide decisions, improvements, and the next experiment. Participation in knowledge-sharing sessions, communities of practice, and collaborative learning opportunities is encouraged.
Qualifications Bachelor's degree in Computer Science, MIS, or other appropriate degree and two years experience or a high school diploma or equivalent and six years of experience. (Note: or a combination of appropriate post high school education and experience equal to six years.)
Preferred Qualifications
- Proven ability to learn new tools and technologies quickly, with a track record of self-directed learning and adaptability in fast-paced environments.
- Demonstrated commitment to continuous learning and professional development.
- Proficient in scripting for infrastructure automation using PowerShell, with the ability to write, debug, and maintain scripts independently or with tools like GitHub Copilot; familiarity with Python or Bash is a plus.
- Experience using infrastructure and configuration as code tools such as Terraform, Ansible, PowerShell, or similar, with version control practices using Git, and integrated development environments like Visual Studio Code.
- Experience creating and troubleshooting CI/CD pipelines using tools such as Azure DevOps, GitHub Actions, or GitLab to automate infrastructure deployment and configuration.
- Experience provisioning and managing infrastructure in cloud environments such as Azure, AWS, or Google Cloud, with an understanding of repeatable deployment processes, and troubleshooting network connectivity with next-generation firewalls.
- Experience deploying containers and familiarity with container orchestration technologies such as Kubernetes.
- Proficient using observability tools such as Elastic, Dynatrace, Prometheus, Grafana, Splunk, Datadog, or others, to ingest new types of data, build dashboards and alerts, and derive insights for performance tuning and incident response.
- Experience improving infrastructure design, automation, or troubleshooting by testing ideas, learning from results, and making thoughtful adjustments over time.
- Experience supporting Windows and Linux systems in an Active Directory domain, including deployment, configuration, and troubleshooting, as well as managing virtual infrastructure using platforms such as Hyper-V or VMware.
- Experience leveraging AI tools to accelerate task completion and improve operational efficiency.
- Demonstrated ability to write and troubleshoot firewall rules and quickly diagnose issues across firewalls, switches . click apply for full job details