Running Production Systems: Level 1, Software Firefighting

You Build It, You Run It. The slogan spreads all around the world across software engineering teams. It’s working great - the successful teams care not only about writing good code, but also how the code is serving the end-users in the production environment.

For Highload projects running software turned into a separate discipline called Site Reliablity Engineering. As one my fellow (former DevOps Engineer, SRE now) told me:

SRE is the next level of DevOps

I think that engineering teams should know how the software will be running in production starting at the very beginning of the project. It includes the knowledge about infrastructure, data storages, monitoring, and deployment pipeline. Good, if you know the budget for all the things.

In the series of blog posts, we focus on the approaches that I’ve seen during my career, starting from the simplest to the most sophisticated one.

Let’s check out the very first level that I called Software Firefighting.

What Does a Software Tech Lead Do?

Tech Lead is a relatively new role in the hierarchy of software development organizations. When I heard about the role for the first time, my first thought was

Is a that software architect + team lead?

I do not think that the definition is correct, but it’s a good way of thinking about that. In the post, I retrospect 3.5 years of my experience in the position that includes:

  • leading one of the teams for Atlassian Stride - complete team communication solution. Within almost 2 years the team had from 5 to 10 engineers.
  • leading KPIdata - a non-profit organization that developed software for accessing the quality of higher education in Kyiv Polytechnic Institute. The team was expanded to 10 core members (only 3 software engineers including myself) and eventually, 180+ individual contributors helped us to deliver the project.
  • leading a team of 4 engineers (including myself) at Video Internet Technologies Ltd for Integration of Video Management Systems (CCTV).

Note, that the same positions might have different responsibilities in different companies.

Check out the blog post to learn about my reality of being a full-time owner of software systems. I elaborate on pros and cons being on a Tech Lead position.

From the practical standpoint - the list of the most critical skills for the position is provided at the very end of the blog.

Does Your Engineering Team Help Your Business To Win?

Yey, DevOps Book club was started in the office. I joined since I love DevOps, increasing the productivity of my team, and, of course reading books. I even did not imagine how useful it can be for solving day-to-day organization challenges with my team and my coaching as a tech lead.

The first book for the club was The Phoenix Project written by Gene Kim, Kevin Behr, and George Spafford. People call the genre as business fiction - it’s a story about an IT manager (ex-marine) that was unexpectedly promoted to VP of IT Operations.

In the blog, you can see my thoughts and notes on reading the book through the prism of my experience working on a team and leading teams. Actionable items are provided as usual.

Note, that we won’t be covering the plot of the novel, if you’re interested in that - read the book.

2017 Tech Accomplishments

Evaluating accomplishments motivates me and gives a breath of fresh air for the new ones. I believe that it’s an essential exercise for goals setting.

I’m proud to be a part of Atlassian Stride team in 2017. Working for the company accelerates professional growth gigantically.

During my vacation, I analyzed the last year of really hard work (the hardest in my career) to make the list of highlights.

No Tests - No Pull Request, Right? Types of Tests that Should Be in Your Codebase

As the blog post Pull Requests: The Good, The Bad and The Ugly claims:

If you do not have time to write tests today - you will find the time for fixing bugs Friday’s night

In other words, to establish solid reliability in production tomorrow we need to invest our time today. Your need for tests for your current project depends on:

  • Size of the team that maintains to the codebase: return True if team.size > 1 else False. Having more engineers means more views on the same items. Tests help to document the opinions how a class or function can be used.
  • Size of the codebase: return True if project.modules > 1 else False. You can’t remember the color of socks that you wore two days ago. Can you remember everything in the project?
  • Duration of development and maintenance phases of the project. The script that you run only once can perfectly live without a solid test coverage. If you’re building a system for decades - please, prepare a good legacy for the next generations of developers.

I have a strong feeling that you think that your code needs tests since you’re still reading this.

In the blog post, I will guide you thru types of automated tests that should be implemented by software engineers: unit, integration, external, and performance ones. It does not cover testing efforts by quality engineers, but the article can still be valuable for them.

You will find code examples that use Python, but you do not have to know the language.

The SQL I Love <3. Efficient pagination of a table with 100M records

I am a huge fan of databases. I even wanted to make my own DBMS when I was in university. Now I work both with RDBMS and NoSQL solutions, and I am very enthusiastic with that. You know, there’s no Golden Hammer, each problem has own solution. Alternatively, a subset of solutions.

In the series of blog posts The SQL I Love <3 I walk you thru some problems solved with SQL which I found particularly interesting. The solutions are tested using a table with more than 100 million records. All the examples use MySQL, but ideas apply to other relational data stores like PostgreSQL, Oracle and SQL Server.

This Chapter is focused on efficient scanning a large table using pagination with offset on the primary key. This is also known as keyset pagination.

Never Give Up, Retry: How Software Should Deal with Failures

It’s doubtful that your backend has everything within one process: you need to read configuration, store customers’ data, write logs and metrics about the status of your software.

If you’re working on a network application - it’s even more complicated: your database can be far far away from the running code.

Some things can go wrong: a network blip might happen, the remote database can be overloaded by incoming requests, a query can reveal some bug in the DBMS and crash it, your data can be out of order on that side because of some reason, and so on.

Microservice architecture encourages cross-process communications over the network. Now your service asks another one for its configuration, that is stored somewhere in the database. You should prepare the software to non-deterministic failures that might occur during the data transfer. And not only then.

In the blog post, we will look into some common failures that can be solved with proper retrying. The basic ideas are described using Python, but experience with the language is not required for understanding.

What Is a Highload Project?

Highload. It was the main buzzword for me 5 or 6 years ago. Since The Social Network movie was released, I wanted to develop such kind of software.

The domain area did not matter for me then: dating services for founders of dating services, illegal online casinos or websites which stream questionable video content - everything would be okay. I wanted to be a part of the team which solves complex engineering problems in scale and delivers product to many thousands and millions of users simultaneously.

I had read dozens of definitions on the Internet from different sources. But I did not understand what does highload mean. And now after years of development of various highload projects I created my very own definition of highload.

Pull Requests: The Good, The Bad and The Ugly

I remember that at my first paid software job we did not have a version control system: all code was uploaded to a folder on FTP server.

Source Code Management was not very sophisticated: we just had the previous revision of the most important source files suffixed with .old in the same directory. Now having ‘just Git’ without a fancy dashboard like the one supplied by Bitbucket, GitHub, or GitLab does not look suitable for me. The tools save a lot of time and are extremely convenient.

Working on our product, a software engineer submits pull requests almost every day. During the last year, I spent approximately 200 hours doing code review - it’s more than 1 month of work!

I believe that merging good pull requests and declining ugly is essential for the success of your product.

What about bad ones? Well, we can do some work to make on them either good or ugly. Let’s review the examples representing different aspects of a pull request. Some ideas are explained using Python, but they are applicable for any other non-esoteric programming language.

🌮 Tacos Delivery Over HTTP/2

Recently I looked into HTTP/2 and its comparison with HTTP/1.1. The adoption of the technology is growing - 16.1% of top Alexa websites already use the latest version of the protocol.

I wanna understand HTTP/2 better. For now, I do not see any project to apply the technology in production. But we all know that another great way to learn something - is to teach somebody else. It happened that y’all are selected as the audience for that :)

In teaching and learning, it’s vital to keep things interesting.

Tacos are definitely not boring. In the blog post, we will try to imagine that we live in the world where web servers deliver tacos instead of HTML-pages. Let’s contemplate pros of serving this delicious Tex-Mex food over HTTP/2 instead of regular HTTP/1.1.

Do not read it if you’re hungry!

© Viach Kakovskyi 2018

This blog is built with Jekyll and hosted on GitHub Pages