Site Reliability Engineering Manager
Company: Tbwa Chiat/Day Inc
Location: Denver
Posted on: March 13, 2025
Job Description:
About CheckrCheckr is building the data platform to power safe
and fair decisions. Established in 2014, Checkr's innovative
technology and robust data platform help customers assess risk and
ensure safety and compliance to build trusted workplaces and
communities. Checkr has over 100,000 customers including DoorDash,
Coinbase, Lyft, Instacart, and Airtable.We're a team that thrives
on solving complex problems with innovative solutions that advance
our mission. Checkr is recognized on Forbes Cloud 100 2024 List and
is a Y Combinator 2024 Breakthrough Company.We're looking for aSite
Reliability Engineering Manager with extensive leadership and
observability experience in cloud-based applications. In this role,
you will lead, manage, and mentor a team of SREs, define and track
metrics surrounding the company's SLO, SLI, and SLAs, and
operationalize incident management, communication, and handling.
The SRE Manager will be responsible for the availability and
performance of all external and internal facing application
endpoints that help drive Checkr's business. Extensive knowledge of
AWS, Kubernetes, and event orchestration is desired. Tooling
knowledge with Datadog, PagerDuty, and Atlassian (Jira, Confluence)
is highly preferred to identify strategies to improve our
full-stack telemetry and monitoring capabilities. Mentoring SREs
contributing to observability-related work, as well as to their
career development.The SRE Manager will work cross-functionally
with Infrastructure, Platform, and Product Engineering, combining
operations work with software engineering principles to assist and
contribute to the high availability of Checkr's systems. You will
serve as a partner to our Product Engineering teams to strategize
on making their services more performant, scalable, observable, and
reliable. We believe every engineering team at Checkr should be
responsible for the software they build, and SREs play a critical
part in providing the tools, practices, and expertise to make that
happen.We are evolving the SRE team to help meet Checkr's
product-first reliability goals for this year and beyond. Having
established a strong foundation--including a containerized
microservices architecture (AWS, Kong, Kubernetes, Kafka, MySQL,
and MongoDB), CI/CD, full-stack monitoring, structured incident
response, and a blameless postmortem culture--we are focused on
implementing new capabilities like:
- Automating observability and alerting across an ever-changing
landscape of microservices
- Automated Service Reliability Scorecards and Production
Readiness Standards
- Software engineering project work, proposed and driven by
individual SRE team members, to remove operational bottlenecks and
increase velocity in ways we've never considered beforeWhat you'll
do:
- Expand and improve our observability and monitoring footprint
in line with cost efficiency.
- Drive and delegate the day-to-day escalations and incidents
with on-call engineering teams.
- Collaborate with other Engineering Managers to define metrics
and dashboarding requirements.
- Ensure stakeholders and partners are informed of incidents and
incident trends while working with other departments, such as
account managers, legal, and marketing, for outbound
communication.
- Review the work of the SRE team, help them get unblocked, and
provide mentoring.
- Meet with the team and individuals weekly to collaborate and
discuss topics related to processes, planning, and goals.
- Manage and assist the on-call incident commander and owners in
resolving production reliability issues, ensuring timely
communication, retrospectives, and postmortems are performed and
delivered.
- Participate in design and production reviews for new features,
products, or infrastructure.
- Assist in planning for the growth of Checkr's infrastructure,
reliability/resiliency, and resources.What we look for:
- 8+ years working in a relevant role, including 4+ years of
technical leadership experience mentoring engineers
- 4+ years of experience architecting and administrating
observability stacks, either managed or self-hosted (e.g., Datadog,
New Relic, Prometheus, Elastic Stack/ELK, OpenTelemetry)
- Experience with operation of containerized microservices
running on the public cloud, asynchronous event processing, and
databases
- Knowledge of Linux, Git, and CI/CD pipelines
- On-call support of highly available production systems
- Designing and building new tools to automate repetitive tasks,
prevent incidents or improve MTTR using programming language such
as Python
- Experience with automation and Infrastructure as Code using
tools like Terraform, Terragrunt, or Cloud Formation
- Understanding of how application components interact and
experience contributing to architectural discussions
- Unwavering commitment to operational security and best
practices
- Ownership: identify problems, propose solutions, and then coach
and guide a team to implement them.
- Connection: motivated to help other teams improve their service
reliability and continuous improvement of tooling and services.What
you get:
- A fast-paced and collaborative environment
- Learning and development allowance
- Competitive compensation and opportunity for advancement
- 100% medical, dental, and vision coverage
- Up to 25K reimbursement for fertility, adoption, and parental
planning services
- Flexible PTO policy
- Monthly wellness stipend, home office stipendAt Checkr, we
believe a hybrid work environment strengthens collaboration, drives
innovation, and encourages connection. Our hub locations are
Denver, CO, San Francisco, CA, and Santiago, Chile. Individuals are
expected to work from the office 2 to 3 days a week. In-office
perks are provided, such as lunch four times a week, a commuter
stipend, and an abundance of snacks and beverages.One of Checkr's
core values is Transparency. To live by that value, we've made the
decision to disclose salary ranges in all of our job postings. We
use geographic cost of labor as an input to develop ranges for our
roles and as such, each location where we hire may have a different
range. If this role is remote, we have listed the top to the bottom
of the possible range, but we will specify the target range for an
exact location when you are selected for a recruiting discussion.
For more information on our compensation philosophy, see our
website .The base salary range for this role is $197,000 to
$232,000 in Denver, CO.Equal Employment Opportunities at
CheckrCheckr is committed to hiring talented and qualified
individuals with diverse backgrounds for all of its tech, non-tech,
and leadership roles. Checkr believes that the gathering and
celebration of unique backgrounds, qualities, and cultures enriches
the workplace.Checkr also welcomes the opportunity to consider
qualified applicants with prior arrest or conviction records.
Checkr's commitment to diversity extends to hiring talented
individuals in spite of a prior criminal history in accordance with
local, state, and/or federal laws, including the San Francisco's
Fair Chance Ordinance .
#J-18808-Ljbffr
Keywords: Tbwa Chiat/Day Inc, Denver , Site Reliability Engineering Manager, Executive , Denver, Colorado
Didn't find what you're looking for? Search again!
Loading more jobs...