Sr. Manager, Site Reliability Engineering
MIS
Dallas, TX
September 23, 2024
What does it mean to be a BrinkerHead? We play like a team, take
Job Summary
As a Senior Site Reliability Manager, you will play a crucial role in ensuring the stability and scalability of our systems. You will be responsible for leading a team of talented engineers, driving initiatives to enhance reliability for our technology systems, streamline operations, and minimize downtime. Your technical expertise, coupled with strong communication skills, and strategic thinking will be instrumental in fostering collaboration across teams and implementing best practices throughout the Digital Guest Experience team.
Your Key Job Functions
- Lead and mentor a team of Site Reliability Engineers, providing guidance and support, while also implementing best practices and resolving complex technical challenges.
- Collaborate with cross-functional teams to define reliability requirements, establish service level objectives (SLOs), and develop a strategic vision along with defined action items to hold accountability among the team
- Monitor system performance, conduct root cause analysis of incidents, implement and document solutions to prevent recurrence.
- Implement monitoring and alerting systems using tools including, but not limited to, New Relic, Noibu, and GCP Logs/ AWS Cloud Logs to proactively identify issues and reach resolution
- Develop and maintain incident response plans, including documentation of procedures, solutions and escalation pathways
- Drive automation initiatives to streamline operations, improve efficiency, and reduce manual intervention.
- Ensure compliance with relevant regulations and standards, including the Americans with Disabilities Act (ADA), California Consumer Privacy Act (CCPA), and General Data Protection Regulation (GDPR).
- Create standardized documentation for all systems, processes, and procedures to ensure support and knowledge sharing across the team and ensure it remains current and relevant
- Have knowledge of industry trends and emerging technologies, evaluating their potential impact on our current systems and utilizing data based recommendations for adoption.
What You Bring to the Team
- Master?s degree and/or bachelor?s degree in combination with equivalent experience in Computer Science, Engineering, or related field.
- 5+ years as a Site Reliability Engineer or similar role, with a demonstrated track record of successfully managing reliability and scalability of large-scale systems.
- Proven technical proficiency in cloud-based environments, including, but not limited to, Google Cloud Platform (GCP).
- Proficiency in utilizing tools to monitor and track reliability, systems performance, data gathering to troubleshoot with tools including, but not limited to New Relic, Noibu, Datadog and GCP Logs
- Demonstrated ability to build and maintain dashboards in tools, including, but not limited to, New Relic and Noibu.
- Excellent written and verbal communication skills and proven ability to utilize various communication in combination with strong interpersonal skills to explain complex technical concepts to stakeholders and/or team members with varying degrees of technical knowledge
- Demonstrated leadership experience, with a passion for mentoring and developing team members.
- Proven ability to problem solve complex issues in a timely fashion
- Proven ability to quickly adapt and flex to a dynamic environment by being a ?self-starter?
- Certifications in cloud computing, including but not limited to, Google Cloud - Professional Cloud Architect
- Familiarity with continuous integration and deployment pipelines and infrastructure (CI/CD) as code (IaC) principles
- Previous experience working within Agile or DevOps organizations