Senior Site Reliability Engineer

Join our innovative Identity Security Cloud software development team as a Senior Site Reliability Engineer (SRE) and play a critical role in ensuring the reliability, scalability, and performance of our services. Collaborate with cross-functional teams, including software engineers, infrastructure platform services, and engineering managers, to drive excellence in system design, architecture, and operations.

Key Responsibilities

  • Partner with development teams to resolve performance issues, optimize system scalability, and implement solutions that enhance reliability, availability, and performance.
  • Design, develop, and deploy alerts, dashboards, and monitoring tools to proactively identify and resolve issues, leveraging expertise from technical leaders and infrastructure platform services.
  • Own and improve key operational metrics, including SLIs, SLOs, Error Budgets, monitoring, and alerting, to drive continuous improvement and inform data-driven decisions.
  • Conduct post-incident reviews and blameless postmortems to identify areas for improvement and implement changes that prevent future incidents.
  • Develop and maintain comprehensive monitoring and alerting systems to ensure proactive issue identification and resolution.
  • Create and optimize dashboards to provide actionable insights, collaborating with technical leads and stakeholders to inform strategic decisions.
  • Collaborate with technical teams, including DevOps, SRE, and infrastructure, to plan capacity, optimize resources, and ensure seamless system operations.
  • Identify and address production performance bottlenecks through profiling, tuning, and optimization, leveraging expertise in programming languages and software engineering principles.
  • Automate repetitive tasks and processes to improve efficiency, reduce manual errors, and enhance overall system reliability.
  • Work closely with software, performance, and test engineers to influence system design and architecture, ensuring that reliability, scalability, and performance are integrated into every aspect of our services.
  • Maintain accurate and up-to-date documentation for systems, processes, runbooks, and procedures, ensuring that knowledge is shared and accessible across the organization.
  • Participate in a 24/7 on-call rotation to develop subject matter expertise and provide timely support for critical system issues.
  • Lead incident postmortem efforts, compiling timely and comprehensive reports that inform future improvements and optimizations.
  • Leverage exceptional diagnostic and problem-solving skills to analyze complex systems and data, identifying areas for improvement and implementing changes that drive excellence in system reliability and performance.

Requirements

  • Bachelor’s degree in Computer Science, a related field, or equivalent practical experience.
  • Proven 5+ years of experience as a Site Reliability Engineer, with a strong understanding of SRE principles and practices.
  • Experience with cloud platforms, including AWS, GCP, or Azure, and proficiency in at least one scripting language, such as Python, Bash, or Go.
  • Familiarity with monitoring and logging tools, including Prometheus, Grafana, Honeycomb, and OpenSearch, and experience with containerization and orchestration technologies, such as Docker and Kubernetes.
  • Strong problem-solving and troubleshooting skills, with excellent communication and collaboration abilities.
  • Ability to work independently and as part of a team, with a strong focus on driving continuous improvement and excellence in system reliability and performance.

Preferred Qualifications

  • Experience with technologies such as Kafka, relational databases, and performance tuning (JVM, Go).
  • Familiarity with Grafana K6 – Continuous Performance Tool.

Onboarding Timeline

  • In the first 30 days, you will:
    • Meet the team and understand the team’s mission and vision.
    • Gain clarity on roles and expectations.
    • Complete development environment setup and mandatory training.
    • Read guides, documentation, and learn company processes and benefits.
  • By 6 months, you should:
    • Understand team goals and OKRs for the quarter and beyond.
    • Complete initial analysis and implementation of SRE team assignments.
    • Be comfortable with tools, systems, and processes used on a day-to-day basis.
    • Complete project work, both supervised and unsupervised.

Apply To This Job

Apply for this job

 

Related Post