Practices and Tactics for Surviving Oncall
As a Production Engineer on the Operating Systems team at Facebook, I'm a member of one of the more active oncalls for one of the largest fleets of servers in the world. In this talk I'm going to cover the mindset of our oncall, how the values of Production Engineering manifest in the practices of our oncalls, and how I ensure that the workload stays managable. Ignoring internal tooling, this will focus more on best practices and mindset.
Some of the areas I will cover:
You're the XO of the ship for your team, not the only person on the ship. Direct, delegate, communicate, and follow up to make sure that issues are addressed but you're not the only one responsible for making sure that the work gets done.
Values => Practice
* Curiousity => actually dig in to the problem, re-evaluate assumptions frequently (ignore the desire to complete the plan/solve the problem you think it is without actually validating what the problem is)
* Respect => responsiveness, trusting other engineers when they report an issue
* Accountability => follow through, not just on the big outage related items but the little annoyances
* Reliability => resiliance
* Visibility => fuck up publicly, solve publicly, talk about things loudly