Senior Site Reliability Engineer

Join our innovative Identity Security Cloud software development team as a Senior Site Reliability Engineer (SRE) and play a critical role in ensuring the reliability, scalability, and performance of our services. Collaborate with cross-functional teams, including software engineers, infrastructure platform services, and engineering managers, to drive excellence in system design, architecture, and operations.

Key Responsibilities

Partner with development teams to resolve performance issues, optimize system scalability, and implement solutions that enhance reliability, availability, and performance.
Design, develop, and deploy alerts, dashboards, and monitoring tools to proactively identify and resolve issues, leveraging expertise from technical leaders and infrastructure platform services.
Own and improve key operational metrics, including SLIs, SLOs, Error Budgets, monitoring, and alerting, to drive continuous improvement and inform data-driven decisions.
Conduct post-incident reviews and blameless postmortems to identify areas for improvement and implement changes that prevent future incidents.
Develop and maintain comprehensive monitoring and alerting systems to ensure proactive issue identification and resolution.
Create and optimize dashboards to provide actionable insights, collaborating with technical leads and stakeholders to inform strategic decisions.
Collaborate with technical teams, including DevOps, SRE, and infrastructure, to plan capacity, optimize resources, and ensure seamless system operations.
Identify and address production performance bottlenecks through profiling, tuning, and optimization, leveraging expertise in programming languages and software engineering principles.
Automate repetitive tasks and processes to improve efficiency, reduce manual errors, and enhance overall system reliability.
Work closely with software, performance, and test engineers to influence system design and architecture, ensuring that reliability, scalability, and performance are integrated into every aspect of our services.
Maintain accurate and up-to-date documentation for systems, processes, runbooks, and procedures, ensuring that knowledge is shared and accessible across the organization.
Participate in a 24/7 on-call rotation to develop subject matter expertise and provide timely support for critical system issues.
Lead incident postmortem efforts, compiling timely and comprehensive reports that inform future improvements and optimizations.
Leverage exceptional diagnostic and problem-solving skills to analyze complex systems and data, identifying areas for improvement and implementing changes that drive excellence in system reliability and performance.

Requirements

Bachelor’s degree in Computer Science, a related field, or equivalent practical experience.
Proven 5+ years of experience as a Site Reliability Engineer, with a strong understanding of SRE principles and practices.
Experience with cloud platforms, including AWS, GCP, or Azure, and proficiency in at least one scripting language, such as Python, Bash, or Go.
Familiarity with monitoring and logging tools, including Prometheus, Grafana, Honeycomb, and OpenSearch, and experience with containerization and orchestration technologies, such as Docker and Kubernetes.
Strong problem-solving and troubleshooting skills, with excellent communication and collaboration abilities.
Ability to work independently and as part of a team, with a strong focus on driving continuous improvement and excellence in system reliability and performance.

Preferred Qualifications

Experience with technologies such as Kafka, relational databases, and performance tuning (JVM, Go).
Familiarity with Grafana K6 – Continuous Performance Tool.

Onboarding Timeline

In the first 30 days, you will:
- Meet the team and understand the team’s mission and vision.
- Gain clarity on roles and expectations.
- Complete development environment setup and mandatory training.
- Read guides, documentation, and learn company processes and benefits.
By 6 months, you should:
- Understand team goals and OKRs for the quarter and beyond.
- Complete initial analysis and implementation of SRE team assignments.
- Be comfortable with tools, systems, and processes used on a day-to-day basis.
- Complete project work, both supervised and unsupervised.

Apply To This Job

Apply for this job

Post Views: 3

Senior Site Reliability Engineer

Key Responsibilities

Requirements

Preferred Qualifications

Onboarding Timeline

Related Post

Need English Tutor ? Work from Home in Rochester, MNNeed English Tutor ? Work from Home in Rochester, MN

Apply Now: FT CUSTOMER SERVICE ADVISOR – WORK FROM HOMEApply Now: FT CUSTOMER SERVICE ADVISOR – WORK FROM HOME

Costco Entry Level Remote Jobs (Work From Home)  Apply NowCostco Entry Level Remote Jobs (Work From Home)  Apply Now

Costco Entry Level Remote Jobs (Work From Home) Apply NowCostco Entry Level Remote Jobs (Work From Home) Apply Now