Books For Site Reliability Engineering

gepandz · 1 year ago

Books For Site Reliability Engineering

WhereIsJA · 1 year ago

Hello! Few questions for you:

If someone in the QA or Tech-Support fields (no real coding experience) were to choose SRE career path, what advice would you give them?
Based your experience, what’s the best tool/app you’ve encountered so far?
What’s your advice to developers/engineers to make your life easier?

Thank you

gepandz · 1 year ago

Build your coding chops, especially for scripting-friendly languages like Bourne shell (sh, bash, zsh, etc. – I use zsh almost exclusively, but that’s a holy war I’d like to sidestep for now 😉), PowerShell (if you can’t avoid Windows systems, it’s a decent tool), Python, Ruby, etc. It helps to know a little about a lot, so if you can at least do some online tutorials for whatever language your developers are writing in (Java, C++, C#, PHP, JavaScript, TypeScript, etc.), that’ll come in handy if you have to dive in and figure out what that log error message is trying to tell you. Uh, I learned most of my programming from a combination of a bachelor’s in CS and a master’s in CS, plus a LOT of cursing… My dad let me have my first computer in my room when I turned 12, a ZeOS running MS-DOS 6.2.2 and Windows 3.1.1 For Workgroups, along with an inch-thick book of printed source code for QBASIC games and told me I could play any game I could write. 😅 We’ve come a long way since the mid-90s, thankfully! There are some great tutorials online that I recommend running through, then look to sites like Advent of Code, a code Advent Calendar that comes out every Christmas season with 50 coding prompts that you can try to solve in any language (it’s how I taught myself Rust). You only get better by cursing at something, in my experience.

Once you’ve got some basic coding skills (or as a way to build those skills 😉), work with your devs on their testing suites. I’ve never worked in a shop with too many tests, and knowing what your code should do helps a lot when you’re trying to figure out what it’s actually doing; tests are how you, well, test that your code is working as expected. Understand how their tests get run, see if they have a test-runner like CircleCI, Jenkins, Travis CI, etc., and if they don’t, see if you can help them build some kind of basic pipeline so once they commit their code to their repository (assuming they’re using Git or another VCS (version-control system) – and if they’re not, they’re wrong 😜), a job will kick off to run their entire test suite for them. There are TONS of guides for how to set that kinda thing up, and it depends on the language they’re using, whether you’re using a cloud, how your repositories are set up and where they’re hosted, etc. I once set up a basic but functional CI pipeline using an in-house server running nothing but Git and some Git Hooks (scripts that run when certain events occur; it’s built into Git itself and provides a LOT of power that few, if any, people even knows exists). The trick is to get the devs to teach you how their stuff works, what they find painful and annoying, and then help them out by making it suck less.

This is a hard question to answer, since SREs and DevOps Engineers have to have a broad background. Everything could affect site reliability, so you have to know a little bit about a LOT, usually enough to be able to find an error message, figure out if it means anything, and either fix it or identify who you need to wake up next to fix it, since you’re usually doing this at 3AM. Oh, and the error messages could be spread across a dozen or thousands of servers, each running some part of your multi-tier, microservices application, so I hope you can read logs for your load balancer, web server, application server, database, and whatever storage system your DB is using… 😂😭 … While trying to get your kid up for school and finding someone who can pick up the threads for you for an hour or so while you drive your kid to school or explain to the incident commander and your manager that they’re gonna have to hold their horses until you get back online (yes, I’ve had this EXACT experience… 😬).

The real force-multipliers for any SRE are two things: their Rolodex (who they can call in when things go beyond their experience level) and a shared library of bitter experiences (usually a wiki, though I’ve used a shared text-file back in my mainframe days; at one shop I set a wiki up and dubbed it the Bitter Experience Document, or BED, “Since the incident isn’t over until it’s been put to BED.” 😂 To quote a former, now-retired, mentor, “I’m only laughing to keep from crying.”). Knowing who to call in when for what helps you in the immediate term and gets the site back up faster. Documenting the root cause in a blameless way, recording what the error that you (the collective “you”) observed as the trigger, the chain of events that led to the failure (especially if you can build a timeline), points where you could have avoided the situation with better automation or alerting, and action items for future-you to take if it crops up, again, all help improve reliability in the long-term, and it CANNOT be done in one person’s notes or kept between one person’s ears; that’s not scalable, and it falls apart if that person does a person-thing, like get sick, take a vacation, has a kid, retires, or drops dead.

Tl;dr: improve your coding skills, make friends with your coworkers and know who to lean on for what, and document the daylights out of every event that occurs, since you WILL forget it, especially at 2AM.

Not to be flippant, but the best tool is one you use that fits its purpose. I’ve used Puppet, Ansible, Jenkins, Pivotal CloudFoundry, debugged another team’s CFEngine config tool… Too many to remember, let alone list. The thing to keep in mind is that, as long as the tool works for what you need it to do and is something that you can teach your coworkers to use, it’s meeting your needs and is fine. If you find a tool that works faster, is easier, or that more of your coworkers will use (so you’re not perpetually on-call), switch to that one. 🤷‍♂️ Tools are far less important than people, so use the tool that helps people the most.
(I’m answering this in the context of what makes an SRE’s life easier…) Write good tests and error messages, for the love of all that’s holy, above or below. A good test is quick, verifies what it needs to verify, and either passes quietly or fails fast and loud. One of the most unforgivable things I’ve seen was a developer choice in a Spring Boot app that every response should always come back with an HTTP 200 (https://http.cat/200) “All is well” code, even when it REALLY threw an HTTP 500 (https://http.cat/500) or 503 (https://http.cat/503) error, and the only way to spot it was to parse the body of the response message. 🤬 I couldn’t set up monitors and traps on actual error codes, because they were hiding them as clean responses, which made understanding the health of the application significantly more difficult. We had to depend on app logs going into an ELK stack and the couple of the couple-dozen microservices that also logged their HTTP responses to a different ELK stack, but in a way that I could scan the message bodies for error messages. 😭

I often get asked, “What should I test?” My simple answer is, “Only test the things you care about.” If your app doesn’t depend on running on a specific OS, don’t test which OS you’re running under (and you’re hopefully writing code that can be as OS-agnostic as possible). If your code could fail because some user (NB: there’s no “L” in “user” 😜) sends in a bunch of garbage in your input field, write a test for garbage in your input field, but don’t test every possible combination of characters from ‘a’ to “ZZZZZZZZZZZZZZZZZZZZZ” – cover your bases, especially the bizarre edge-cases, and call it good. Every failure, every page, every incident, should result in at least one new test in each affected system, since that happened because your group missed a real event that could occur and didn’t test for it. You’re smarter, now, if less-well-rested… 😂

It’s really nice if you can work with your observability folks (also usually SREs) to add event-hooks to your code with performance and health-check data to feed into their monitoring system, like Grafana, New Relic, DataDog, etc., since more data usually correlates to more information, which we can mine for better intelligence, and, ideally, take more effective and earlier action (a trick I learned from a VP a LOOONG time ago was the Data -> Information -> Intelligence -> Action pipeline; I, an engineer, was talking Data at a VP, who was looking for the Action he needed to take, and it didn’t go well).

Tl;dr: Tests that test just enough to be trustworthy that grow over time, error messages that are meaningful and that point you to what is likely wrong (especially if you link to a wiki page or something about how to fix the error state 🙌), and coordinate with the monitoring and observability folks to get them as many useful metrics as you can.

Sorry to wall-of-text at you, but these are questions I’ve gotten a LOT, and I think they’re important to ask and to answer. I usually answer them over beer or when teaching an intern or something, and verbal communication is so much faster than written comms in some ways…

Hope this helps and that I didn’t scare you off from being an SRE! It’s … never dull, which is what I like about it, but the constant interrupt-driven nature of it can be wearing at times, but that’s another problem that’s really a culture-problem, not an engineering one. 😅