We have weekly internal talks called Clarisights Talks where we share learnings with each other. We’ve had talks pertaining to multiple domains, from “How to get started with Figma” to “How to build scalable distributed systems”.
Suraj recently talked about “how to debug things under fire” which made us think on how we can make it better, extend, share and learn from others outside our company too.
So Suraj did what he does best — he tweeted about it.
Response being overwhelmingly positive, we decided to make it a real thing after @hashfyre replied -
Now it's a real thing and here's is the plan —
Stories, NOT Talks
Every fire is unique, it is never a bunch of missteps happening over and over again. For that reason, we believe this should not be a bullet point slideshow presentation. It is the why, how and what that makes this meetup interesting. This is more about a community trying to help each other out of the fire than paying attention to fireman(😉) who saved the day. So come in, tell us about your outages, how you recovered, and what measures were added to protect the futurepotential hazards.
First meetup is happening on 25th January, 2020. The frequency still needs to be decided but we are thinking to have it bi-monthly (once in two months).
It's scheduled between 4 to 6pm :) We decided to have it in post afternoon so everyone can have a sound sleep 😆
Clarisights would host the first meetup and post that it would be open for anyone to host it, If we are unable to find anyone else, we always have a backup.
Why should this meet-up exist?
If you look back and see, you have been leveling up, adding more and more abstractions to make your life easier and are moving up the stack. Not having to worry about low level systems might sound great but it also brings an additional complexity since each new system that we depend on and interact with gets more complicated.
Eventually the more things you add in the stack, the more states are introduced and leading to greater failure points.
You can build a reliable system from unreliable components and it will work under normal conditions. To an extent,almost all the distributed systems would fit into this description where you take a bunch of unreliable components and build a reliable system out of it.
Hence, the saying "all distributed systems run in degraded state". But if you really objectively think, it's actually anexpected behaviour. These systems fail when more than one component fail.
Running distributed systems is hard, so traditional wisdom says don't run distributed system, if you don't need it. But in reality, to have a reliable service you need to run distributed systems. Even if you are not running it, you are using it from your cloud providers and hence understanding it is extremely useful when things go south -
The goal of this meetup is to share our failure stories with each other, and discuss how things fail, learn and get better at running systems at scale.
PS: did we just start a meetup with a tweet?
We obviously also have a website, failuremodes.dev which would be a source of truth for all our knowledge centre. Feel free to raise a PR incase you want to add more resources.
Also, here's the invitation form for the meetup incase you need it :)