Running Production Systems: Level 1, Software Firefighting

You Build It, You Run It. The slogan spreads all around the world across software engineering teams. It’s working great - the successful teams care not only about writing good code, but also how the code is serving the end-users in the production environment.

For Highload projects running software turned into a separate discipline called Site Reliablity Engineering. As one my fellow (former DevOps Engineer, SRE now) told me:

SRE is the next level of DevOps

I think that engineering teams should know how the software will be running in production starting at the very beginning of the project. It includes the knowledge about infrastructure, data storages, monitoring, and deployment pipeline. Good, if you know the budget for all the things.

In the series of blog posts, we focus on the approaches that I’ve seen during my career, starting from the simplest to the most sophisticated one.

Let’s check out the very first level that I called Software Firefighting.

Why put firefighting foam on your code

Software Firefighting - is the spot where engineers usually start. You do not need to have any experience to begin. Doing that actually can help you to grow… until some point.

You write some code, push it to production, and hope that everything works fine. Yes, you do not have a lot of information on how the system behaves and how healthy is that.

This tactic is OK for the cases when it’s not a big deal if your service does not operate properly:

Pet projects
Students’ projects
Hackathons (I’ve never had proper monitoring during a hackathon, have you?)
Prototypes/demo projects

But I used to work in the environment when business-related applications do not have any telemetry.

My boss called me at 11 PM to tell that the piece of ~~sh*t~~ fantastic software is not working, and the next morning we’re demoing the system for the customer that should pay for that.

Trust me; it’s not funny at all. After such calls, you connect to the box that runs the software via SSH. You try to understand what’s going on. In Software Firefighting mode you try to reproduce the issue in production to see something useful in logs. Another option if your logs are not very verbose or you want to jump into lower level - attach a debugger to a running process.

If you stick with the Software Firefighting approach for running production software, I see a high probability for you to be working late at night. Very late at night… Might be OK for the fans of nightly coding. Not sure if you be paid for the overtime though.

The vendor from another continent dialed me. They asked to expedite issues with our software that are happening right away during the exhibition. It happened to me at 3 AM. The exciting experience that I would like to avoid in the future.

When you find the cause - you experiment on the live instance try to patch the code, eventually create a PR that starts with the word hotfix and deploy this straight to production without proper peer reviews. Your team will learn about that later… hopefully not from a new incident.

Ultimate Software Firefighting workflow

Well, I warned you that it’s not suitable for business applications. Now I will share my thoughts on how to do it.

Identify the problem.
- Connect to the production environment.
- Collect live stats (in the next section will discuss how to get trained in that area).
Reproduce the problem.
- Localize - find the place where it’s happening. The smaller is the localized area - the better it is for you: service, module, class, method, code statement, variable…
- Repeat the harmful action to verify if that is the right place (only if that’s safe from the business perspective).
Prepare the hotfix.
- Find similar problems in your incidents registry/closed Jira tickets or on Stack Overflow.
- Make the change locally.
- Test it without touching production.
- If you’re sure about the change - move it to production.
- Verify that the problem is solved (and the new issues were not created).
Write a blog for your team to share your learnings from the firefighting session.

Train the firefighters

Troubleshooting in production is a great skill. I wish I had it on the higher level then it’s now (but I do not want any business to pay for that). Here’re the levels of troubleshooting, ordered from the easiest to more complex things:

Check the state of the box that is running the application (healthy or not).
- top - the command collects resource usage statistics on your machine and provides the dynamic view on the resources utilization. That’s the easiest thing that you can do to gain some situational awareness. Links for learning:
Check the statements that the software is logging. The logs for your service, web server, load balancer - all the things. grep and tail are your best friends there.
Investigate the suspicious running process(es) without restarting. Since it’s not always straightforward how to reproduce bugs in production, I was thrilled when I learned that we could connect to the code that already serves our customers. It gives the ability to have a way deeper look into details. Some tools that can give you a clue:
- gdb - is the GNU debugger that allows you to set breakpoints and temporarily stop the execution of a process to see values of variables in real time. The more you understand source code of the program the easier it would be to find the error. When you need to investigate a crash you can set a breakpoint just before the program crashes. Some tips:
  - You can see waiting threads using gdb
  - You can extend gdb using Python (you can also go that with GoLang AFAIK)
The tool helped me to figure out an issue with regular expressions that caused a major incident in a large cloud system.

gdb gives you more insights that you have looking only in logs. You can find many blogs on how to use the debugger for your technology stack. I’d like to share the link to blog post about Debugging of CPython processes with gdb by Roman Podoliaka. For the backend applications written in Python it can be more convenient to use pdb. Here’s the great tutorial how to start using the tooling.
Trace all the things on the box (network traffic, library calls, system calls). If nothing of above helps - look into the universe of tools and approaches for introspecting your software. Check out the links for learning:
- Choosing a Linux Tracer (2015)
- Linux Performance

Also, we have a couple of tools suggested by subscribers (Carlos Neira and Amie Wang) from the related Twitter thread:

If you’re interested especially in Performance Optimization, I extremely recommend you to watch the videos (90 mins) about Linux Performance Tools by Brendan Gregg - Part 1 and Part 2. He explains the performances optimization methodologies and provides a good number of practical examples. And read his blog. It’s a lot of knowledge.

During learning the topic, I also found the slides from Theo Schlossnagle interesting.

Pros and Cons

Let’s analyze the benefits of the cowboy style of running production software.

Pros:

Heroism. You feel like a hero rescuing your business from disasters.
Cheap. No investments in your infrastructure are needed.

Cons:

Affects the reputation of your business. You’re aware only when customers/business owners find that service does not work. Example: you learn from Twitter feed that your service does not work for end-users and they’re moving to a competitor. It sucks.
Requires engineering team to be trained.
Time-consuming. You need to reproduce the issue in prod to gather telemetry right on the box.
Not accountable. Leads to running hotfixes in production that might not exist in your repository. And the other engineers on your team might learn nothing on how to fix such issues.
Stressful. Dangerous to your mental and physical health. As well as your personal life.
Non-cooperative. It’s hard to handover work if you need to step-out.

Again, I do not recommend running business software applications in the Firefighting mode.

We can remove some disadvantages of the approach. To achieve that we need to grow and reach the next level of maturity. The second blog post in the series will be published soon.

Well, what’s your favorite debugging/performance tool?

P.S. The blog post started as the Twitter thread. You can subscribe to my Twitter account or blog to do not miss the next knowledge sharing session about backend software engineering.

AllYouNeedIsBackend