I used to lead teams that owned message bus, a stream processing framework and a distributed scheduler (like k8s) at Facebook.
The oncall was brutal. At some point I thought I should work on something else, perhaps even switch careers entirely. However this also forced us to separate user issues and system issues accurately. That’s only possible because we are a platform team. Since then I regained my love for distributed systems.
Another thing is, we had to cut down on the complexity - reduce number of services that talked to each other to a bare minimum. Weigh features for their impact vs. their complexity. And regularly rewrite stuff to reduce complexity.
Now Facebook being Facebook, valued speed and complexity over stability and simplicity. Specially when it comes to career growth discussions. So it’s hard to build good infra in the company.
It's been a pretty poor mantra from the beginning anyway. How about we move at a realistic pace and deliver good features, without burning out, and without leaving a trail of half-baked code behind us?
The oncall was brutal. At some point I thought I should work on something else, perhaps even switch careers entirely. However this also forced us to separate user issues and system issues accurately. That’s only possible because we are a platform team. Since then I regained my love for distributed systems.
Another thing is, we had to cut down on the complexity - reduce number of services that talked to each other to a bare minimum. Weigh features for their impact vs. their complexity. And regularly rewrite stuff to reduce complexity.
Now Facebook being Facebook, valued speed and complexity over stability and simplicity. Specially when it comes to career growth discussions. So it’s hard to build good infra in the company.