Principal Site Reliability Operations Engineer
Company: Roblox Corporation
Location: San Mateo
Posted on: January 23, 2025
Job Description:
Every day, tens of millions of people come to Roblox to explore,
create, play, learn, and connect with friends in 3D immersive
digital experiences- all created by our global community of
developers and creators.At Roblox, we're building the tools and
platform that empower our community to bring any experience that
they can imagine to life. Our vision is to reimagine the way people
come together, from anywhere in the world, and on any device. We're
on a mission to connect a billion people with optimism and
civility, and looking for amazing talent to help us get there.A
career at Roblox means you'll be working to shape the future of
human interaction, solving unique technical challenges at scale,
and helping to create safer, more civil shared experiences for
everyone.We're looking for a Principal Site Reliability Operations
Engineer with a passion for problem-solving to join our Reliability
Response team. The ideal candidate will have demonstrated expertise
in handling incidents and thrive in a dynamic, complex, and
ever-evolving distributed environment. Your ability to identify and
address root causes will be crucial for driving sustainable,
long-term solutions and achieving success in this role.As a
Principal Site Reliability Operations Engineer on the Reliability
Team, you will handle production incidents and improve Roblox's
incident processes. You will maintain reliability service-level
objectives, lead incident resolution with determination, and
collaborate with service teams to identify and implement actionable
improvements during the incident postmortem process. If you are
passionate about maintaining uptime in a sophisticated distributed
environment full of continuous change, you'll be right at home with
our Reliability team.This role will report to the Senior Manager,
Reliability and requires 3 in-office days per week.You Will:
- Lead and manage production incidents.
- Collaborate cross-functionally to troubleshoot and resolve
sophisticated technical challenges.
- Guide the implementation of incident management processes and
procedures, ensuring fast and effective responses to minimize
impact.
- Continually monitor system health, performance and capacity,
proactively addressing potential issues.
- Conduct comprehensive post-mortem analysis to ascertain the
root cause of incidents and formulate corrective measures.
- Contribute substantially to the design and improvement of
system architecture to boost reliability and performance.
- Leverage coding skills to automate daily routine tasks and
improve system efficiency.
- Serve in the Incident Manager On-Call rotation.
- Mentor junior team members.You Have:
- At least 8+ years of experience in a comparable role within a
Site Reliability Team.
- Advanced knowledge of systems and network infrastructure
protocols.
- Demonstrated ability in managing, troubleshooting, and
resolving incidents in distributed environments.
- Experience solving problems.
- An ability to distill complex technical issues into clear and
concise language.
- Familiarity with at least one scripting or programming language
to automate routine tasks (Python, Golang, or similar languages
preferred).
- You have a Bachelor's degree, or equivalent experience, in
Computer Science, Computer Engineering, or a similar technical
field.You Are:
- A great communicator; you are able to explain complex systems
clearly to stakeholders and fellow engineers.
- Able to operate in potentially ambiguous circumstances during a
production incident.
- Familiar with the interactions of services in a distributed
system.
- Tenacious towards driving challenging production incidents to
resolution.For roles that are based at our headquarters in San
Mateo, CA: The starting base pay for this position is as shown
below. The actual base pay is dependent upon a variety of
job-related factors such as professional background, training, work
experience, location, business needs and market demand. Therefore,
in some circumstances, the actual salary could fall outside of this
expected range. This pay range is subject to change and may be
modified in the future. All full-time employees are also eligible
for equity compensation and for benefits.Annual Salary
Range$226,450 - $262,150 USDRoles that are based in our San Mateo,
CA Headquarters are in-office Tuesday, Wednesday, and Thursday,
with optional in-office on Monday and Friday (unless otherwise
noted).You'll Love:
- Excellent medical, dental, and vision coverage
- A rewarding 401k program
- Flexible vacation policy (varies by exemption status)
- Roflex - Flexible and supportive work policy
- At Roblox HQ:
- Free catered lunches five times a week and several fully
stocked kitchens with unlimited snacks
- Onsite fitness center and fitness program credit
- Annual CalTrain Go PassRoblox provides equal employment
opportunities to all employees and applicants for employment and
prohibits discrimination and harassment of any type without regard
to race, color, religion, age, sex, national origin, disability
status, genetics, protected veteran status, sexual orientation,
gender identity or expression, or any other characteristic
protected by federal, state or local laws. Roblox also provides
reasonable accommodations for all candidates during the interview
process.
#J-18808-Ljbffr
Keywords: Roblox Corporation, Santa Rosa , Principal Site Reliability Operations Engineer, Professions , San Mateo, California
Didn't find what you're looking for? Search again!
Loading more jobs...