Jekyll2018-09-24T01:21:00+00:00http://allyouneedisbackend.com/All You Need Is BackendBackend Software Engineering and Technical Leadership within distributed teams. From Viach Kakovskyi.Running Production Systems: Level 1, Software Firefighting2018-09-21T00:00:00+00:002018-09-21T00:00:00+00:00http://allyouneedisbackend.com/blog/2018/09/21/software-firefighting-running-production-software<div class="image-wrapper">
<amp-img media="(min-width: 550px)" src="http://d1vt1c82ljabfd.cloudfront.net/images/this-is-fine.jpg" alt="Running Software in Production. Levels of Maturity" class="image-right" width="300" height="300" layout="fixed">
</amp-img>
</div>
<p><strong>You Build It, You Run It.</strong> The slogan spreads all around the world across software engineering teams. It’s working great - the successful teams care not only about writing good code, but also how the code is serving the end-users in the production environment.</p>
<p>For <a href="http://allyouneedisbackend.com/blog/2017/08/30/what-is-highload/" target="_blank">Highload projects</a> running software turned into a separate discipline called <strong>Site Reliablity Engineering</strong>. As one my fellow (former DevOps Engineer, SRE now) told me:</p>
<blockquote>
<h4 id="sre-is-the-next-level-of-devops">SRE is the next level of DevOps</h4>
</blockquote>
<p>I think that engineering teams should know how the software will be running in production starting at the very beginning of the project. It includes the knowledge about infrastructure, data storages, monitoring, and deployment pipeline. Good, if you know the budget for all the things.</p>
<p>In the <em>series of blog posts</em>, we focus on the approaches that I’ve seen during my career, starting from the simplest to the most sophisticated one.</p>
<p>Let’s check out the very first level that I called <strong>Software Firefighting</strong>.</p>
<!--more-->
<h2 id="why-put-firefighting-foam-on-your-code">Why put firefighting foam on your code</h2>
<p><strong>Software Firefighting</strong> - is the spot where engineers usually start. You do not need to have any experience to begin. Doing that actually can help you to grow… until some point.</p>
<p>You write some code, push it to production, and hope that everything works fine. Yes, you do not have a lot of information on how the system behaves and how healthy is that.</p>
<p>This tactic is OK for the cases when it’s not a big deal if your service does not operate properly:</p>
<ul>
<li>Pet projects</li>
<li>Students’ projects</li>
<li>Hackathons (I’ve never had proper monitoring during a hackathon, have you?)</li>
<li>Prototypes/demo projects</li>
</ul>
<p>But I used to work in the environment when business-related applications do not have any telemetry.</p>
<blockquote>
<h4 id="my-boss-called-me-at-11-pm-to-tell-that-the-piece-of-sht-fantastic-software-is-not-working-and-the-next-morning-were-demoing-the-system-for-the-customer-that-should-pay-for-that">My boss called me at 11 PM to tell that the piece of <del>sh*t</del> fantastic software is not working, and the next morning we’re demoing the system for the customer that should pay for that.</h4>
</blockquote>
<p>Trust me; it’s not funny at all. After such calls, you connect to the box that runs the software via SSH. You try to understand what’s going on. In Software Firefighting mode you try to reproduce the issue in production to see something useful in logs. Another option if your logs are not very verbose or you want to jump into lower level - attach a debugger to a running process.</p>
<p>If you stick with the Software Firefighting approach for running production software, I see a high probability for you to be working late at night. Very late at night… Might be OK for the fans of nightly coding. Not sure if you be paid for the overtime though.</p>
<blockquote>
<h4 id="the-vendor-from-another-continent-dialed-me-they-asked-to-expedite-issues-with-our-software-that-are-happening-right-away-during-the-exhibition-it-happened-to-me-at-3-am-the-exciting-experience-that-i-would-like-to-avoid-in-the-future">The vendor from another continent dialed me. They asked to expedite issues with our software that are happening right away during the exhibition. It happened to me at 3 AM. The exciting experience that I would like to avoid in the future.</h4>
</blockquote>
<p>When you find the cause - you experiment on the live instance try to patch the code, eventually create a PR that starts with the word <code class="highlighter-rouge">hotfix</code> and deploy this straight to production without proper peer reviews. Your team will learn about that later… hopefully not from a new incident.</p>
<h2 id="ultimate-software-firefighting-workflow">Ultimate Software Firefighting workflow</h2>
<p>Well, I warned you that it’s not suitable for business applications. Now I will share my thoughts on how to do it.</p>
<ol>
<li>Identify the problem.
<ul>
<li>Connect to the production environment.</li>
<li>Collect live stats (in the next section will discuss how to get trained in that area).</li>
</ul>
</li>
<li>Reproduce the problem.
<ul>
<li>Localize - find the place where it’s happening. The smaller is the localized area - the better it is for you: service, module, class, method, code statement, variable…</li>
<li>Repeat the harmful action to verify if that is the right place (only if that’s safe from the business perspective).</li>
</ul>
</li>
<li>Prepare the hotfix.
<ul>
<li>Find similar problems in your incidents registry/closed Jira tickets or on Stack Overflow.</li>
<li>Make the change locally.</li>
<li>Test it without touching production.</li>
<li>If you’re sure about the change - move it to production.</li>
<li>Verify that the problem is solved (and the new issues were not created).</li>
</ul>
</li>
<li>Write a blog for your team to share your learnings from the firefighting session.</li>
</ol>
<h2 id="train-the-firefighters">Train the firefighters</h2>
<p>Troubleshooting in production is a great skill. I wish I had it on the higher level then it’s now (but I do not want any business to pay for that). Here’re the levels of troubleshooting, ordered from the easiest to more complex things:</p>
<ol>
<li><strong>Check the state of the box that is running the application (healthy or not)</strong>.
<ul>
<li><strong target="_blank"><a href="https://linux.die.net/man/1/top">top</a></strong> - the command collects resource usage statistics on your machine and provides the dynamic view on the resources utilization. That’s the easiest thing that you can do to gain some situational awareness. Links for learning:
<ul>
<li><a href="http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html" target="_blank">Linux Load Averages: Solving the Mystery</a></li>
<li><a href="https://gtacknowledge.extremenetworks.com/articles/How_To/Understanding-the-output-of-the-TOP-command" target="_blank">Understanding the output of TOP command</a></li>
<li><a href="https://www.thegeekstuff.com/2010/01/15-practical-unix-linux-top-command-examples/" target="_blank">15 Practical Linux Top Command Examples</a></li>
</ul>
</li>
</ul>
</li>
<li>
<p><strong>Check the statements that the software is logging</strong>. The logs for your service, web server, load balancer - all the things. <code class="highlighter-rouge">grep</code> and <code class="highlighter-rouge">tail</code> are your best friends there.</p>
</li>
<li>
<p><strong>Investigate the suspicious running process(es) without restarting.</strong> Since it’s not always straightforward how to reproduce bugs in production, I was thrilled when I learned that we could connect to the code that already serves our customers. It gives the ability to have a way deeper look into details. Some tools that can give you a clue:</p>
<ul>
<li>
<p><strong><a href="https://www.gnu.org/software/gdb/">gdb</a></strong> - is the GNU debugger that allows you to set breakpoints and temporarily stop the execution of a process to see values of variables in real time. The more you understand source code of the program the easier it would be to find the error. When you need to investigate a crash you can set a breakpoint just before the program crashes. Some tips:</p>
<ul>
<li>You can <a href="https://benbernardblog.com/my-startling-encounter-with-python-debuggers/" target="_blank">see waiting threads using gdb</a></li>
<li>You can <a href="https://sourceware.org/gdb/onlinedocs/gdb/Python.html" target="_blank">extend gdb using Python</a> (you can also go that with GoLang AFAIK)</li>
</ul>
</li>
</ul>
<blockquote>
<h4 id="the-tool-helped-me-to-figure-out-an-issue-with-regular-expressions-that-caused-a-major-incident-in-a-large-cloud-system">The tool helped me to figure out an issue with regular expressions that caused a major incident in a large cloud system.</h4>
</blockquote>
<p><strong>gdb</strong> gives you more insights that you have looking only in logs. You can find many blogs on how to use the debugger for your technology stack. I’d like to share the link to blog post about <a href="http://podoliaka.org/2016/04/10/debugging-cpython-gdb/" target="_blank">Debugging of CPython processes with gdb</a> by <a href="https://twitter.com/rpodoliaka">Roman Podoliaka</a>. For the backend applications written in Python it can be more convenient to use <a href="https://docs.python.org/3/library/pdb.html" target="_blank">pdb</a>. Here’s <a href="https://github.com/spiside/pdb-tutorial" target="_blank">the great tutorial</a> how to start using the tooling.</p>
</li>
<li>
<p><strong>Trace all the things on the box (network traffic, library calls, system calls)</strong>. If nothing of above helps - look into the universe of tools and approaches for introspecting your software. Check out the links for learning:</p>
<ul>
<li><a href="http://www.brendangregg.com/blog/2015-07-08/choosing-a-linux-tracer.html" target="_blank">Choosing a Linux Tracer (2015)</a></li>
<li><a href="http://www.brendangregg.com/linuxperf.html" target="_blank">Linux Performance</a></li>
</ul>
</li>
</ol>
<p>Also, we have a couple of tools suggested by subscribers (<a href="https://twitter.com/CarlosN26157061" target="_blank">Carlos Neira</a> and <a href="https://twitter.com/Amie42" target="_blank">Amie Wang)</a> from <a href="https://twitter.com/BackendAndBBQ/status/1026138692514205699" target="_blank">the related Twitter thread</a>:</p>
<ul>
<li><a href="https://github.com/iovisor/bcc" target="_blank">BPF Compiler Collection - Tools</a></li>
<li><a href="https://www.kernel.org/doc/html/v4.15/dev-tools/kgdb.html" target="_blank">kgdb</a></li>
<li><a href="https://docs.oracle.com/cd/E18752_01/html/816-5041/intro-1.html" target="_blank">mdb</a> for Solaris OS</li>
<li><a href="http://valgrind.org/" target="_blank">Valgrind</a></li>
</ul>
<p>If you’re interested especially in Performance Optimization, I extremely recommend you to watch the videos (90 mins) about <strong>Linux Performance Tools</strong> by <a href="http://www.brendangregg.com/" target="_blank">Brendan Gregg</a> - <a href="https://www.youtube.com/watch?v=FJW8nGV4jxY" target="_blank">Part 1</a> and <a href="https://www.youtube.com/watch?v=zrr2nUln9Kk" target="_blank">Part 2</a>. He explains <a href="http://www.brendangregg.com/methodology.html" target="_blank">the performances optimization methodologies</a> and provides a good number of practical examples. And <a href="http://www.brendangregg.com/overview.html" target="_blank">read his blog</a>. It’s a lot of knowledge.</p>
<p>During learning the topic, I also found <a href="http://lethargy.org/~jesus/misc/production-troubleshooting.pdf" target="_blank">the slides</a> from <a href="https://lethargy.org/~jesus/page/about/" target="_blank">Theo Schlossnagle</a> interesting.</p>
<h2 id="pros-and-cons">Pros and Cons</h2>
<p>Let’s analyze the benefits of the cowboy style of running production software.</p>
<p><strong>Pros:</strong></p>
<ul>
<li><strong>Heroism.</strong> You feel like a hero rescuing your business from disasters.</li>
<li><strong>Cheap.</strong> No investments in your infrastructure are needed.</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li><strong>Affects the reputation of your business.</strong> You’re aware only when customers/business owners find that service does not work. Example: you learn from Twitter feed that your service does not work for end-users and they’re moving to a competitor. It sucks.</li>
<li><strong>Requires engineering team to be trained.</strong></li>
<li><strong>Time-consuming.</strong> You need to reproduce the issue in prod to gather telemetry right on the box.</li>
<li><strong>Not accountable.</strong> Leads to running hotfixes in production that might not exist in your repository. And the other engineers on your team might learn nothing on how to fix such issues.</li>
<li><strong>Stressful.</strong> Dangerous to your mental and physical health. As well as your personal life.</li>
<li><strong>Non-cooperative.</strong> It’s hard to handover work if you need to step-out.</li>
</ul>
<p><strong>Again, I do not recommend running business software applications in the Firefighting mode.</strong></p>
<p>We can remove some disadvantages of the approach. To achieve that we need to grow and reach the next level of maturity. The second blog post in the series will be published soon.</p>
<p><strong>Well, what’s your favorite debugging/performance tool?</strong></p>
<p><strong>P.S.</strong> The blog post started as <a href="https://twitter.com/BackendAndBBQ/status/1026138692514205699" target="_blank">the Twitter thread</a>. You can subscribe to my Twitter account or blog to do not miss the next knowledge sharing session about backend software engineering.</p>You Build It, You Run It. The slogan spreads all around the world across software engineering teams. It’s working great - the successful teams care not only about writing good code, but also how the code is serving the end-users in the production environment. For Highload projects running software turned into a separate discipline called Site Reliablity Engineering. As one my fellow (former DevOps Engineer, SRE now) told me: SRE is the next level of DevOps I think that engineering teams should know how the software will be running in production starting at the very beginning of the project. It includes the knowledge about infrastructure, data storages, monitoring, and deployment pipeline. Good, if you know the budget for all the things. In the series of blog posts, we focus on the approaches that I’ve seen during my career, starting from the simplest to the most sophisticated one. Let’s check out the very first level that I called Software Firefighting.What Does a Software Tech Lead Do?2018-08-03T00:00:00+00:002018-08-03T00:00:00+00:00http://allyouneedisbackend.com/blog/2018/08/03/what-does-a-tech-lead-do<div class="image-wrapper">
<amp-img media="(min-width: 550px)" src="http://d1vt1c82ljabfd.cloudfront.net/images/tech-lead-small.png" alt="What Does a Tech Lead Do?" class="image-right" width="256" height="208" layout="fixed">
</amp-img>
</div>
<p><strong>Tech Lead</strong> is a relatively new role in the hierarchy of software development organizations. When I heard about the role for the first time, my first thought was</p>
<blockquote>
<p>Is a that <strong>software architect + team lead</strong>?</p>
</blockquote>
<p>I do not think that the definition is correct, but it’s a good way of thinking about that. In the post, I retrospect 3.5 years of my experience in the position that includes:</p>
<ul>
<li>leading one of the teams for <strong><a href="https://www.stride.com/" target="_blank">Atlassian Stride</a></strong> - complete team communication solution. Within almost 2 years the team had from 5 to 10 engineers.</li>
<li>leading <strong>KPIdata</strong> - a non-profit organization that developed software for accessing the quality of higher education in <a href="https://en.wikipedia.org/wiki/Igor_Sikorsky_Kyiv_Polytechnic_Institute" target="_blank">Kyiv Polytechnic Institute</a>. The team was expanded to 10 core members (only 3 software engineers including myself) and eventually, 180+ individual contributors helped us to deliver the project.</li>
<li>leading a team of 4 engineers (including myself) at <strong><a href="https://www.linkedin.com/company/vit-ltd/" target="_blank">Video Internet Technologies Ltd</a></strong> for Integration of Video Management Systems (CCTV).</li>
</ul>
<p>Note, that the same positions might have different responsibilities in different companies.</p>
<p>Check out the blog post to learn about my reality of being a full-time owner of software systems. I elaborate on <strong>pros and cons</strong> being on a Tech Lead position.</p>
<p>From the practical standpoint - the list of <strong>the most critical skills</strong> for the position is provided at the very end of the blog.</p>
<p><!--more--></p>
<h2 id="being-a-full-time-owner">Being a full-time owner</h2>
<p>The first thing that I found on the position is that now <strong>I’m 100% responsible</strong> for one of the chapters of an engineering organization. The good part about that - the new chapter did not have anything in production yet. So, I did not have any legacy code from previous maintainers to support and extend. That was nice.</p>
<p>However, it’s not the rule, and every company is different. I think that more often you have a chance to enhance an existing software system instead create something from scratch. So, be ready to be responsible for the projects that were not started and designed by your team.</p>
<p>What does it mean to be a full-time owner?</p>
<ul>
<li><strong>You receive tasks are tied up to specific business goals.</strong> Actually, the tasks are projects. Moreover, the requirements can be partially defined. You need to go ahead and figure out all the requirements and constraints. You want to specify the desired outcome of the project as much as you can to prevent scope creep. Understanding and defining the end goals is the very first step.</li>
<li><strong>You can do whatever is reasonable for you within your time/budget to achieve the goals.</strong> We look into that in more details in the next section.</li>
<li><strong>All successes (and failures) of the software that is created to achieve the goal are associated with you.</strong> If something is broken in your system or does not work as expected - it’s your responsibility and fault. In the case when the goal is overachieved - great job! Do not forget to give credit to your team for the successes though. The people deserve it.</li>
</ul>
<h2 id="the-space-for-the-engineering-creativity">The space for the engineering creativity</h2>
<p><strong>Yes, you can do anything to achieve the engineering goals.</strong> Here’s the list of the things that I was able to change or implement. Note, that you should get buy-in from your team to make the changes persistent. People make software. Happy people make working software.</p>
<ul>
<li><strong>Software development methodology</strong>. Strongly depends on the goals of the project and deadlines. Answer the questions to define that:
<ul>
<li>How many days are in an iteration?</li>
<li>What’s the planning process? Which tasks should be estimated? How to estimate tasks?</li>
<li>Should we accept changes in requirements or not between iterations?</li>
<li>What are the rules for the tasks of different types/priorities? Example: all bugs for the <em>Billing</em> component must be fixed ASAP regardless severity.</li>
<li>How to demo that to the rest of the organization?</li>
</ul>
</li>
<li><strong>Technical stack</strong> for the project. It can include but not limited to programming languages, frameworks, data storages, libraries, monitoring solutions. Sometimes you have some pre-defined preset dictated by the company’s policies. For our chapter the stack was the following:
<ul>
<li>Python 3, asynchronous programming, asyncio</li>
<li>MySQL, Elasticsearch, Redis</li>
<li>AWS (EC2, RDS, ElastiCache, S3, SQS, CloudFormation, CloudWatch)</li>
<li>DataDog, Elasticsearch/Logstash/Kibana, ElastAlert, Splunk</li>
</ul>
</li>
<li><strong>Software architecture</strong>. You define the structural parts of the software system. You can build something new. You can reuse existing in-company or third-party services. Designing interfaces between different components is also your responsibility if you’re a Tech Lead. Have fun with all that!</li>
<li><strong>Non-functional requirements</strong>. That’s about defining the border between <em>good enough</em> and <em>perfect</em> software. I never was encouraged to make an ideal commercial solution. Usually, people <em>just</em> need a stable solution to solve their business problems. The solution should be flexible enough to let us apply new changes fast. For me, that means setting the reasonable expectation for engineers to make the business happy. Examples:
<ul>
<li>The component should be resilient to database restarts…</li>
<li>…but if the connection cannot be established within 60 seconds - please, alert</li>
</ul>
</li>
<li><strong>Internal milestones</strong>. You can set the focus for the team for different stages of the project as well as define deliverables for that.
<ul>
<li>For example, the project roadmap can be optimized to have a version of the system in production ASAP to establish CI/CD pipeline as well as ensure that your ideas are principally working.</li>
<li>Another example - you can target to make your teammates as autonomous as possible (a good idea when Y’all geographically distributed) - then you need to spend more time for planning to define independent work streams.</li>
</ul>
</li>
<li><strong>Service Level Indicators</strong>. As a Tech Lead, you’re in charge of defining when your software provides the needed quality of service. Picking the right set of the indicators that reflect the reality of your business is vital because it sets the target for your team as well as the direction for engineering improvements. Examples from my experience:
<ul>
<li><strong>Availability.</strong> Can the service be used?</li>
<li><strong>Number of processed jobs.</strong> Do we still need the service?How much useful work we’re doing?</li>
<li><strong>Success rates for the principal components.</strong> - helps us to see problems on the middle level.</li>
</ul>
</li>
<li><strong>Rollout schedule</strong>. It includes how often to deploy the software to different environments.
<ul>
<li>As soon as a pull request is merged</li>
<li>OR do releases once per 4 months.</li>
</ul>
</li>
<li><strong>Communication.</strong> How does the team communicate about the daily progress?
<ul>
<li>30-minute video calls two times per day</li>
<li>Text standup once per week <em>(maybe)</em></li>
</ul>
</li>
<li><strong>Split of work</strong>. How are the tasks in your Jira assigned?
<ul>
<li>You assign each task to every engineer and they do not have any chance to change that without your written permission (not very good tactic)</li>
<li>Everybody can take any task regardless priority and dependencies</li>
</ul>
</li>
<li><strong>Code review policy</strong>. Who should approve a pull request to let the creator merge it to master? Options:
<ul>
<li>Consensus - all concerns are answered and all default reviewers approved the changes</li>
<li>At least 2 approvals from Senior engineers should be received to proceed</li>
<li>I can approve my PR and deploy after 2 hours after the last commit</li>
</ul>
</li>
<li><strong>Retrospectives.</strong> How often to do them? My recommendation is once per 4 weeks, but I know that some teams do it every 2 weeks. Btw, how often do you do them?</li>
</ul>
<p>I omitted some things so feel free to add your ideas as comments.</p>
<h2 id="how-technical-is-a-tech-lead">How technical is a Tech Lead?</h2>
<p>My mission was to enable the team to implement the right solution to the problem.</p>
<blockquote>
<p>You do not write much code on a daily basis</p>
</blockquote>
<p>Before I became a Tech Lead on the latest team, I was working more than 1.5 years on Intermediate/Senior Software Engineer positions in the same area within the same group of people. It was essential for me to gain the needed practical experience with asynchronous programming, relational and non-relational databases, instant messaging, and highload systems.</p>
<p>To make your project successful first of all you should <strong>read</strong> a lot of:</p>
<ul>
<li>Code
<ul>
<li>Pull requests made by your team.</li>
<li>The solutions that your systems reuse.</li>
<li>Code of third-party services maintained by other teams that you need work with.</li>
</ul>
</li>
<li>Technical documentation
<ul>
<li>Description of the services that you can re-use (both in-house and third-party ones).</li>
<li>Implementation details of the solutions.</li>
<li>Known issues for them (nothing is perfect) - to understand risks and plan mitigation for them.</li>
</ul>
</li>
</ul>
<p>After a lot of reading you <strong>write</strong> a bit:</p>
<ul>
<li>Engineering proposals - <a href="https://www.atlassian.com/team-playbook/plays/daci">DACI</a> is a useful framework. I love it.</li>
<li>After the proposals are decided - design pages.</li>
<li>And in the very end - tickets for some work (my team runs on <del>caffeine</del> Jira Software).</li>
</ul>
<p>And after the writing - you <strong>discuss</strong>:</p>
<ul>
<li>Reach agreement with your teammates regarding non-trivial tasks.</li>
<li>Educate your teammates if you have a non-complete specification or did not provide all the data sources.</li>
<li>Negotiate contracts with other teams.</li>
<li>Demo results of your work as well as promote your solutions within the company.</li>
</ul>
<p>At the end of the day, you might have a couple of hours to make the individual contribution. For me it was something like the following:</p>
<ul>
<li>Hotfixes. Needed to fix something when the world was about to explode.</li>
<li>Make a proof of concept for a pull request without writing tests. After that, ask somebody from the team to turn it into the production-grade software.</li>
<li>Commit database or configuration changes.</li>
<li>Investigate a weird bug that can be hardly reproduced in the development environment.</li>
<li>Pull some data from metrics/logging solution to validate an idea of implementation.</li>
</ul>
<p>I think that a Tech Lead should have solid practical software engineering experience to be able to make and support reasonable decisions.</p>
<blockquote>
<p>On small teams (up to 3 direct reports) I think that it’s still possible to make some good volume of individual contribution.</p>
</blockquote>
<p>At the moment of writing, I do not have developed my engineering leadership skill enough to be able to make sustainable individual contribution on larger teams.</p>
<h2 id="pros-and-cons-of-being-a-tech-lead">Pros and Cons of being a Tech Lead</h2>
<p><strong>Pros:</strong></p>
<ul>
<li>You become a subject matter expert in the area of your project.</li>
<li>You have a complete understanding of how the software system works and how to apply changes into that with minimal risk. You can replicate it to other systems now.</li>
<li>You become a good communicator because you’re responsible for understanding requirements and explaining technical solutions.</li>
<li>You reach some level of competency (not always very high, though) in various areas of software development:
<ul>
<li><strong>System design</strong> - to architect your software and validate all the risks on early stages.</li>
<li><strong>Operations</strong> - to keep your systems up and running.</li>
<li><strong>Quality engineering</strong> - to prevent losses of your company’s reputation.</li>
<li><strong>Engineering management</strong> - to delegate implementation to your team or even other teams.</li>
</ul>
</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>At the end of a workday, you often do not have a feeling of accomplishment. You have generated some new work for your team, resolved some blockers but it does not feel like real work.</li>
<li>Not enough coding on larger teams.</li>
<li>You’re the entry point for your team. You should be able to accept tasks from multiple sources:
<ul>
<li>Your teammates</li>
<li>Your management</li>
<li>Partner teams</li>
<li>Customer support team</li>
<li>Other people that have heard about your team</li>
</ul>
</li>
<li>Sometimes it’s stressful because it’s a lot of responsibility. Eventually, you should learn how to handle all that.</li>
</ul>
<p>I think that the position is worth trying and I’m happy that I had opportunity to serve in the position for years. I’d do it again.</p>
<h2 id="tl-starter-pack">TL-starter pack</h2>
<p>If you’re interested in a Tech Lead position and would like to prepare for that, he’s the list of skills that I found valuable in the very beginning of the path:</p>
<ul>
<li><strong>Practical proficiency in the programming languages</strong> from your stack - to be able to make good technical choices and do the code review as well. Make the proper start of the project is crucial so your coding skills can help with that dramatically to define structure and basic components.</li>
<li><strong>Good level of skills related to data stores</strong> - I think that in the majority of projects you deal with information read or stored from somewhere. Also, the knowledge is a perfect ground for system design competence.</li>
<li><strong>Project management</strong> - for organizing your work in the new multi-tasking environment as well as work of other people.</li>
<li><strong>Communication skills</strong> - the position is about enabling other people to do technical work.</li>
</ul>
<p>I believe that these 4 skills are enough and the rest of the skills can be built during the project on top of them. I hope that the blog post will help to improve technical leadership in software teams.</p>
<p><strong>P.S.</strong> In the blog, when I say “you do something” means “you’re responsible for something.” As a Tech Lead you can delegate some complex engineering to the experts on your team but be able to verify, approve or correct the solutions. Also, being a decision-maker does not equal to being a dictator and ignoring the voices of other people.</p>
<p><strong>P.P.S.</strong> From my perspective, the difference between <strong>Team Lead and Tech Lead</strong> is in responsibilities:</p>
<ul>
<li>Team Lead is responsible for people, not project.</li>
<li>Team Lead does People Management.</li>
<li>Team Lead is not supposed to make the individual contribution.</li>
</ul>
<p><strong>P.P.P.S.</strong> Also, in my opinion, the difference between <strong>Architect and Tech Lead</strong>:</p>
<ul>
<li>Architect has more practical and diverse experience.</li>
<li>Architect is needed for more extensive and more complex systems.</li>
<li>Architect position is more about doing the most laborious work instead enabling the rest of the team to do all the work.</li>
</ul>Tech Lead is a relatively new role in the hierarchy of software development organizations. When I heard about the role for the first time, my first thought was Is a that software architect + team lead? I do not think that the definition is correct, but it’s a good way of thinking about that. In the post, I retrospect 3.5 years of my experience in the position that includes: leading one of the teams for Atlassian Stride - complete team communication solution. Within almost 2 years the team had from 5 to 10 engineers. leading KPIdata - a non-profit organization that developed software for accessing the quality of higher education in Kyiv Polytechnic Institute. The team was expanded to 10 core members (only 3 software engineers including myself) and eventually, 180+ individual contributors helped us to deliver the project. leading a team of 4 engineers (including myself) at Video Internet Technologies Ltd for Integration of Video Management Systems (CCTV). Note, that the same positions might have different responsibilities in different companies. Check out the blog post to learn about my reality of being a full-time owner of software systems. I elaborate on pros and cons being on a Tech Lead position. From the practical standpoint - the list of the most critical skills for the position is provided at the very end of the blog.Does Your Engineering Team Help Your Business To Win?2018-03-25T00:00:00+00:002018-03-25T00:00:00+00:00http://allyouneedisbackend.com/blog/2018/03/25/your-engineering-team-helps-your-business-to-win<div class="image-wrapper">
<amp-img media="(min-width: 550px)" src="http://d1vt1c82ljabfd.cloudfront.net/images/the-phoenix-project-small.png" alt="Does Your Engineering Team Help Your Business To Win?" class="image-right" width="256" height="208" layout="fixed">
</amp-img>
</div>
<p>Yey, <em>DevOps Book</em> club was started in the office. I joined since I love DevOps, increasing the productivity of my team, and, of course reading books. I even did not imagine how useful it can be for solving day-to-day organization challenges with my team and my coaching as a tech lead.</p>
<p>The first book for the club was <strong><a href="https://itrevolution.com/book/the-phoenix-project/" target="_blank">The Phoenix Project</a></strong> written by Gene Kim, Kevin Behr, and George Spafford. People call the genre as <em>business fiction</em> - it’s a story about an IT manager (ex-marine) that was unexpectedly promoted to VP of IT Operations.</p>
<p>In the blog, you can see my thoughts and notes on reading the book <strong>through the prism of my experience working on a team and leading teams</strong>. Actionable items are provided as usual.</p>
<p>Note, that we won’t be covering the plot of the novel, if you’re interested in that - read the book.</p>
<p><!--more--></p>
<h2 id="high-level-takeaways">High-level takeaways</h2>
<ul>
<li><strong>Learn why your business needs you to make every single bit of software</strong>. If you work on a commercial project, you should know why you’re paid. And how the cash for your paycheck is generated. Each code commit should provide additional value to your company.
Examples:
<ul>
<li>
<p>A new feature that attracts new users or makes current users more satisfied with the product. Or even causes customers to buy the premium version of the product.</p>
</li>
<li>
<p>A bugfix that makes existing customers happier and retains them with your product instead of making them think about solutions made by competitors.</p>
</li>
<li>
<p>An improvement that makes engineering/product/support teams more productive to release their time to do other beneficial work</p>
</li>
<li>
<p>Maintenance that prevents future issues or incidents that can lead to loss of trust of your customers. Also, incidents eat your time that should be dedicated to other work.</p>
</li>
</ul>
<p>I think that in commercial development, each software system is related to some business goal. If you do not know the goal that your team is achieving - ask your management. If there is no objective - question, what’s the need for the company to pay you for work?</p>
</li>
<li><strong>Breakdown business OKRs or KPIs into engineering deliverables</strong>.
<ul>
<li>
<p>Need more users? Learn why you’re not gaining MAU and find how your code can add them.</p>
</li>
<li>
<p>Need the product to have higher reliability? Invest into that by setting up a special team to do preventive maintenance.</p>
</li>
<li>
<p>Need to have more revenue? Pay attention to make paid features more attractive. Make sure that the things that you do and your team does help the organization to achieve at least one of the goals. Otherwise, it’s waste of your talent and company’s money.</p>
</li>
</ul>
</li>
<li>
<p><strong>Identify the types of work that your team is doing</strong>. According to the book, the four types of efforts exist:</p>
<ol>
<li>
<p><em>Business Projects</em> - that’s what you’re asked to do by product managers/project sponsors. Goals of such project are tight to business objectives. Execution such projects increase MAU, customers’ satisfaction, or revenue.</p>
</li>
<li>
<p><em>Internal Projects</em> - that’s what you need to keep achieving your business goals. Engineering incentives from your team or asks from external teams fall into the category. Results of finishing such projects are not so visible out of the groups of people involved into that. But when such projects are skipped or executed poorly - it impacts business functions.</p>
</li>
<li>
<p><em>Changes</em> include actual deployment of deliverables made during the two types of projects listed above. Also, the category covers all small housekeeping work.</p>
</li>
<li>
<p><em>Unplanned Work</em> - dealing with incidents, and emergencies. You probably do not have time in your work schedule allocated for that. Doing that distracts you from doing other types of work. That sucks.</p>
</li>
</ol>
</li>
<li><strong>Implement change management process</strong> within your organization. Changes in code or infrastructure that are done by one team can affect another team. You do not want to be surprised when they “upgrade” the version of the company-wide database to the one that does not have your favorite deprecated function. More real case - database schema changes that can be hardly reverted. I know that this can be hard, but you (or your management) should:
<ul>
<li>Build product roadmap for each team at least for one quarter.</li>
<li>Make it aligned across various teams.</li>
<li>Communicate about all backward-incompatible changes ahead of time if you have any.</li>
</ul>
<p>To succeed in leading your team, you should think outside your team/department or event product. Think, how the significant changes that you’re going to bring affect the company as its customers as a whole. Always evaluate the risky changes.</p>
</li>
<li>
<p><strong>Define your development process and find the bottleneck</strong>. The process might vary from team to team even within one organization. The steps might include the following actions (the list is <em>very</em> simplified):</p>
<ol>
<li>
<p>Evaluate customers’ feedback. If your customers vote in public Jira for some functionality or open support tickets - you’re lucky. Use that as the input.</p>
</li>
<li>
<p>Identify the real need behind the request to define a feature. Define functional requirements for the software. Prioritize the feature and put it into the product roadmap.</p>
</li>
<li>
<p>Allocate engineering resources within the organization to implement the functionality. Define non-functional requirements.</p>
</li>
<li>
<p>Design how the feature should be implemented. Make work breakdown structure and plan the execution. Communicate with external teams if any assistance is needed.</p>
</li>
<li>
<p>Write code. Cover the code with tests. Perform peer-to-peer review. Fix comments. Deploy to staging. Perform testing in staging. Find bugs. Fix bugs. Deploy the code to production.</p>
</li>
<li>
<p>Enable the feature for the customers. Receive customers’ feedback. <strong>GOTO p.1</strong> :)</p>
</li>
</ol>
<p>Each of the steps involves different skills and roles - from support engineers and product managers to software and quality engineers.
<strong>Your goal is to find the constraint - the slowest/busiest chain link and make it faster</strong>.</p>
<p>According to the Eliyahu M. Goldratt’s “Theory of Constraints.”:</p>
</li>
</ul>
<blockquote>
<h4 id="any-improvements-made-anywhere-besides-the-bottleneck-are-an-illusion">Any improvements made anywhere besides the bottleneck are an illusion.</h4>
</blockquote>
<p>Only in that case you will see the improvement in feature delivery and be achieving business goals as a result.</p>
<ul>
<li><strong>Set limits for Work-In-Progress</strong>. Having the number of tasks in progress higher than your throughput means the following:
<ul>
<li>
<p>Your organization already paid for something to be done but your customers do not get the value from it.</p>
</li>
<li>
<p>Since the work in-queue is waiting for resources that means the money is not used to maximize value for the organization.</p>
</li>
</ul>
<p>On my team, we limit the number of In Progress tickets to the number of engineers. It’s very unlikely that one engineer works on two tasks at the same time. In that case, one of the tickets is probably blocked or waiting on somebody else to provide input.
You should focus on finishing the in-progress work instead of starting work on new tasks.</p>
</li>
<li>
<p><strong>Standardize the work that your team is doing</strong>. Hardly ever your squad faces with unique tasks. Common tasks can be:</p>
<ul>
<li>Perform database schema change.</li>
<li>Change infrastructure configuration.</li>
<li>Add more verbose logging; tracking more metrics and dashboards for them.</li>
<li>Add an API endpoint.</li>
<li>Add usage of a new API endpoint provided by an external team.</li>
<li>Profile and optimize some piece of software.</li>
<li>Change business rules for data transformation.</li>
<li>Refactor a module for better maintainability.</li>
<li>Investigate customers’ request.</li>
</ul>
<p>The idea is to collect historical data about the way which the tasks are resolved as well as the time needed to implement the changes.</p>
<p>The information should help you to achieve two objectives: first, train new team members - they can look how similar issues were resolved; second, provide estimates for your business owners.</p>
</li>
<li>
<p><strong>Do not raise critical players</strong>. If only one person on a team can perform some tasks, it makes the person extremely busy. And the team becomes exceptionally dependent on the engineer.</p>
<p>Eventually, jobs are waiting for the person to be free. The engineer becomes constraints for your project. To have sustainable development process, you want to have <em>at least two persons</em> that can do some task.</p>
<p>To eliminate bus factor, initiate internal knowledge sharing and invest in automation and documentation for the things that cannot be described as code. It helps you to raise team players. The excuse <em>“it’s easier for me to do that than expain”</em> should never be approved.</p>
</li>
<li>
<p><strong>Track all work requests that come to your team</strong>. Product managers and support engineers will be distracting your team. From achieving the goals that they defined for you. Yes, it sounds like a paradox. But it happens. I think that the reason of that: it’s hard to evaluate all customers’ needs and prioritize them.</p>
</li>
<li>
<p><strong>Ideally, you should budget some time for urgent/unplanned work.</strong> Each request should have a Jira ticket. Asks from external engineering teams should be tracked and prioritized as well. For example, if your service exposes a private API that is used by ten other engineering teams - be ready that some of them will ask you for some customization of non-trivial support. And vice-versa - your vendors inside your company can change the rules of the game because of their needs.</p>
</li>
<li>
<p><strong>Enable your business to make experiments</strong>. Engineering teams should provide the ability to verify product assumptions with minimal investment into implementation or without coding at all.</p>
<p>I think that in the quarter our team succeeds in the field: some business was able to do some experiments without distracting the team from achieving other commitments. The way how to do that was not clear to me at the beginning of the journey, but after reading the book and having actual results, I see the full picture.</p>
</li>
<li>
<p><strong>Make your business accept risk when they do not give you resources/time/budget</strong>. We as engineers can suggest the priority for some maintenance tasks and preventive actions. Good, if we can provide insights about possible customers’ impact. It’s <em>always</em> not enough time to fix all bugs and build all the features. Your business owners should understand trade-off and decide your priorities.</p>
</li>
<li>
<p><strong>Freeze low-priority work</strong>. It’s better to have your team working on a couple of in-flight projects and accomplish them in-time rather than have Work-In-Progress that already consumes resources and does not provide value for the business, customers or your team yet.</p>
</li>
<li>
<p><strong>Consider using cloud providers.</strong> They offer opportunities to think less and lend resources instead of buying them or mastering more complex/efficient algorithms. If processing of some background job takes enormous unacceptable time with your current codebase/ infrastructure - consider parallelizing that with enabling additional computational resources only for the time of the job.</p>
</li>
<li><strong>Consider outsourcing.</strong> Some parts of your business or legacy applications can be given to external vendors. It can reduce cost and release the smartest brains that are on your team. But make sure that the contract includes not only maintenance of the system but also the implementation of the changes needed to support your possible business initiatives. Also, make sure that the outsourcing team is capable of doing the required changes timely.</li>
</ul>
<h2 id="other-engineering-tips">Other engineering tips</h2>
<ul>
<li>
<p><strong>Automate installation/provisioning of the environments needed for development, quality assurance, staging, and production.</strong> Keep the environment as much close to each other as you can - same versions of OS, databases, library. You should be able to access them fast: keep them provisioned and pay for that or make the provisioning fast. Manual instructions should die, and manual changes should never be applied.</p>
<p>Remove humans from the deployment process. Maximum involvement should be clicking the <em>Deploy Now</em> button. Set up of development environment for new teammates should be done within a day or so.</p>
</li>
<li>
<p><strong>Setup delivery pipeline and measure the throughput</strong>. Classically, it includes writing code and deploying code to production to deliver value to the end-customers. In my opinion, it also includes identifying a need (business or engineering one) and prioritizing the needing/scheduling the work.</p>
</li>
<li>
<p><strong>Document (even better, automate!) “magic fixes” for all incidents</strong>. You need to be able to replicate them if the issue occurs again. Keep them in your projects’ knowledge base. You cannot rely on the hope that the engineer that solved the problem the last time will always be available to assist. That’s it, changes in the systems that you own should be transparent and repeatable.</p>
</li>
<li>
<p><strong>Proactively find all fragile parts of your software</strong>. If you work on the system that was developed before you joined the team - be ready for surprises. Things can break where you do not expect that. Besides codebase and project documentation (if your team has good enough documentation) your sources to learn that can be: results of load testing, metrics, and logs from production, registry of closed bugs, customer support tickets.</p>
</li>
<li>
<p><strong>Stabilize infrastructure to be focused on development, not firefighting.</strong> It’s hard to make reasonable estimates and do not work overtime when you need not only to develop new features but also keep existing buggy software up and running. I will post a separate blog on the topic. Stay tuned.</p>
</li>
<li>
<p><strong>Include slack time into your business commitments</strong>. If engineers on your team are 100% loaded according to your plan that means any unplanned work should wait for in-queue (this is bad) or the commitments won’t be met. Having some idle time for your engineers is fine since you cannot predict actual time to accomplish work as well as changes in requirements.</p>
</li>
<li>
<p><strong>Avoid handoff of tasks between engineers and cross-teams.</strong> Context switch kills productivity. Having more than one responsible person enforces corporate ping-pong and makes harder to get things done.</p>
</li>
<li>
<p><strong>Measure how often your code CAN be deployed to production</strong>. Do you know how many deployments per day your business needs? How many of them can you do without affecting your routine? You would like to know the answers at least for the case when an incident occurs, and you need to push the hotfix to prevent loss of the company’s reputation.</p>
</li>
<li>
<p><strong>Make all code changes accountable and authorized</strong>. As well as infrastructure changes they should go through version control system, peer-to-peer review process and <em>sometimes</em> approved by business/budget owners or external teams.</p>
</li>
<li>
<p><strong>Move your working code to production ASAP</strong>. Until the code is in production and is enabled for customers - no value is generated from doing product research, creating Jira tickets, design meetings, writing code, and reviewing pull requests.</p>
</li>
<li>
<p><strong>Make faster releases</strong> and do that in small batches. For me, the ages when we give our customers a new version of backend software that <em>runs on our infrastructure</em> once per month are over. Every merged pull request should be deployed individually (and rolled back). In that case, you can observe how the change affects your system and find failures fast.</p>
</li>
<li>
<p><strong>Prepare rollback strategy for deployment of all large/risky changes.</strong> Examples: altering database tables with dozens of records, extreme refactoring, data migrations, switching vendors. If you think that your testing is not enough (or it’s expensive to cover all needed cases) - I would invest into that.</p>
</li>
<li>
<p><strong>Know about your incidents before your customers or business find that.</strong> First of all, it gives you more time to investigate and fix the issue. Secondary - timely updated status page is the face of your team. It’s just caring about feelings of your customers.</p>
</li>
<li>
<p><strong>Build a passionate team that is OK to work late hours and weekends to rescue the business when it’s really needed</strong>. It should be compensated somehow eventually including additional days off to recover and spend time with family or friends. You also can setup on-call rotation to have somebody on duty 24/7 be ready to fix any problems.</p>
</li>
</ul>
<p>I’m happy that <strong>The Phoenix Project</strong> book was selected for the DevOps book club. Reading the book, discussions with other engineers and retrospective look back helps me to define the next steps to improve development process in our backend engineering team.</p>
<p><strong>What’s favorite book about DevOps?</strong></p>Yey, DevOps Book club was started in the office. I joined since I love DevOps, increasing the productivity of my team, and, of course reading books. I even did not imagine how useful it can be for solving day-to-day organization challenges with my team and my coaching as a tech lead. The first book for the club was The Phoenix Project written by Gene Kim, Kevin Behr, and George Spafford. People call the genre as business fiction - it’s a story about an IT manager (ex-marine) that was unexpectedly promoted to VP of IT Operations. In the blog, you can see my thoughts and notes on reading the book through the prism of my experience working on a team and leading teams. Actionable items are provided as usual. Note, that we won’t be covering the plot of the novel, if you’re interested in that - read the book.2017 Tech Accomplishments2018-01-01T00:00:00+00:002018-01-01T00:00:00+00:00http://allyouneedisbackend.com/blog/2018/01/01/2017-tech-accomplishments<div class="image-wrapper">
<amp-img media="(min-width: 550px)" src="http://d1vt1c82ljabfd.cloudfront.net/images/stride-2017-medium.png" alt="Tech Accomplishments in 2017" class="image-right" width="256" height="320" layout="fixed">
</amp-img>
</div>
<p>Evaluating accomplishments motivates me and gives a breath of fresh air for the new ones. I believe that it’s an essential exercise for goals setting.</p>
<p>I’m proud to be a part of <a href="http://stride.com" target="_blank">Atlassian Stride</a> team in 2017. Working for the company accelerates professional growth gigantically.</p>
<p>During my vacation, I analyzed the last year of really hard work (the hardest in my career) to make the list of highlights.</p>
<p><!--more--></p>
<h2 id="the-list">The list</h2>
<ul>
<li>
<p><strong>Our product, Atlassian Stride is announced!</strong> It’s not a secret anymore. You can apply for <a href="https://signup.stride.com/" target="_blank">Early Access Program</a> and use it. We will be inviting Hipchat Cloud customers to <a href="https://www.stride.com/help-center/upgrade-guide" target="_blank">upgrade to Stride</a>.</p>
</li>
<li>
<p><strong>I became a technical leader of a geo-distributed backend team</strong>; a part of Atlassian Stride product. The transition happened in November 2016 but the first project was delivered by our team (called <strong>Stride Transformers</strong>) in February 2017. I think that I finally understood the new role when the services started working in production environment serving needs of real people.</p>
</li>
<li>
<p><strong>The Stride Transformers engineering team grew up from 4 to 8 members</strong> including myself. Having all the talented and passioned people moving towards common product goals was essential on the road to success. Here’s the shortlist of some things that we delivered playing as a team:</p>
<ul>
<li>
<p><strong>We built asynchronous Python framework to wrap-up existing codebase and reused it for other projects in the same domain.</strong> It saved us a lot of time, reduced the number of mistakes, boring tasks, and recruited members of other teams to join us and learn the framework :). By using the framework, we created around 40 Python services for 10 another related projects. All infrastructure for that was defined as code and described with CloudFormation templates.</p>
</li>
<li>
<p><strong>We built a Python library that generates … other async Python libraries</strong> - clients for internal APIs made by other teams. After that breaking-compatibility changes stopped being a nightmare. It’s cheap for us to update our codebase.</p>
</li>
<li>
<p><strong>Using the tooling mentioned we successfully built software from scratch</strong>, delivered the scheduled projects and moved them to production. The “intimate feeling” of enabling the services in production is unforgettable.</p>
</li>
<li>
<p><strong>Besides the planned work, we had some nerd fun.</strong> Our team won <a href="https://www.atlassian.com/company/shipit" target="_blank">Atlassian ShipIt</a> (a quarterly hackathon) this year two times in a row - in June 2017 and in September 2017. Both in Austin’s location and in People’s Choice nomination (other fellow Atlassians vote for projects). I learned that making software that works in the staging environment is possible within 24 hours. The main thing - the services built for the first project were productized, polished accordingly and are already running in production. Speaking about the latest ShipIt project - it was selected for Stride Award. Looking forward to tackling it to deliver to our customers.</p>
</li>
</ul>
</li>
</ul>
<div class="image-wrapper">
<amp-img media="(min-width: 550px)" src="http://d1vt1c82ljabfd.cloudfront.net/images/shipit1.png" alt="June 2017 - We Won Atlassian ShipIt in Austin" class="image-right" width="600" height="288" layout="fixed">
</amp-img>
</div>
<div class="image-wrapper">
<amp-img media="(min-width: 550px)" src="http://d1vt1c82ljabfd.cloudfront.net/images/shipit2.jpg" alt="June 2017 - We Won Atlassian ShipIt in Austin" class="image-right" width="600" height="444" layout="fixed">
</amp-img>
</div>
<ul>
<li>
<p><strong>I think that I learned how to do cross-team collaboration in the right way.</strong> I’m happy that in Software Engineering you can engage talent worldwide. This year I collaborated with the teams located in Texas, Ukraine, Australia, and California. I like the moment when you first time finally meet a person that worked with you for a couple of months. And go for a lunch :)</p>
</li>
<li><strong>I slightly improved my presentation skills</strong> and gave two public talks for Austin Python Meetup. Also, I gave a company-wide talk about the technology that we built as well as a dozen of demos for different Stride milestones. The slides from publically available talks can found:
<ul>
<li><a href="/talks/#austin-python-meetup-2017-2" target="_blank">How to Stop Worrying and Start a Project with Python 3</a></li>
<li><a href="/talks/#austin-python-meetup-2017-1" target="_blank">What’s New in Pythons 3.5 and 3.6?</a></li>
</ul>
</li>
<li>
<p><strong>Started the <a href="http://AllYouNeedIsBackend.com" target="_blank">All You Need Is Backend</a> blog</strong> and published 9 posts. Some of them were featured on HackerNews and were in Top-5 for a couple of days. I found sharing my thoughts very useful for keeping knowledge in order.</p>
</li>
<li>
<p><strong>Completed 15 technical online courses</strong>. Primarily on Amazon Web Services, Distributed Systems, and various Data Storages: Kafka, Cassandra, Hadoop, Riak, and CouchDB.</p>
</li>
<li><strong>Finally, my tech stack from 2017:</strong>
<ul>
<li><a href="https://www.python.org/" target="_blank">Python</a>, and it’s only Python 3</li>
<li><a href="https://docs.python.org/3/library/asyncio.html" target="_blank">Asyncio</a>, <a href="https://aiohttp.readthedocs.io" target="_blank">aiohttp</a>, and other <a href="https://github.com/aio-libs" target="_blank">aio-libs</a></li>
<li><a href="https://www.mysql.com/" target="_blank">MySQL</a>, <a href="https://www.elastic.co/products/elasticsearch" target="_blank">Elasticsearch</a>, and <a href="https://redis.io/" target="_blank">Redis</a></li>
<li>Amazon Web Services: <a href="https://aws.amazon.com/ec2/" target="_blank">EC2</a>, <a href="https://aws.amazon.com/s3/" target="_blank">S3</a>, <a href="https://aws.amazon.com/sqs/" target="_blank">SQS</a>, <a href="https://aws.amazon.com/rds/" target="_blank">RDS</a>, <a href="https://aws.amazon.com/elasticache/" target="_blank">ElastiCache</a>, <a href="https://aws.amazon.com/cloudformation/" target="_blank">CloudFormation</a>, <a href="https://aws.amazon.com/cloudwatch/" target="_blank">CloudWatch</a></li>
<li>Monitoring: <a href="https://www.datadoghq.com/" target="_blank">Datadog</a>, <a href="https://www.elastic.co/products" target="_blank">Elasticsearch/Logstash/Kibana</a>, <a href="https://elastalert.readthedocs.io" target="_blank">ElastAlert</a>, <a href="https://www.splunk.com/" target="_blank">Splunk</a></li>
<li>Atlassian tools: <a href="https://www.stride.com/" target="_blank">Stride</a>, <a href="https://www.atlassian.com/software/jira" target="_blank">Jira</a>, <a href="https://www.atlassian.com/software/bitbucket" target="_blank">Bitbucket</a>, <a href="https://www.atlassian.com/software/bamboo" target="_blank">Bamboo</a>, <a href="https://www.atlassian.com/software/confluence" target="_blank">Confluence</a>, and <a href="https://www.atlassian.com/software/trello" target="_blank">Trello</a></li>
</ul>
</li>
</ul>
<p>Writing the list was great, and I really enjoyed that. I am so grateful that the Atlassian company and Stride organization gave me this opportunity to grow. It was hard to achieve all the things, but we’re doing the right ones.</p>
<p><strong>Kudos to my wife Tania</strong> for her patience when I had late meetings with Sydney teams (we’re <em>8 hours ahead</em> of them) and extremely early collaboration with my teammates in Ukraine after that (Texas is <em>8 hours behind</em> them).</p>
<p>I wrote a list of my tech goals for 2018 but will share this with you in a year. We will see what will be accomplished over the time.</p>
<p>P.S. I also gained 20 pounds eating BBQ and TexMex. Will try to gain more the next year.</p>Evaluating accomplishments motivates me and gives a breath of fresh air for the new ones. I believe that it’s an essential exercise for goals setting. I’m proud to be a part of Atlassian Stride team in 2017. Working for the company accelerates professional growth gigantically. During my vacation, I analyzed the last year of really hard work (the hardest in my career) to make the list of highlights.No Tests - No Pull Request, Right? Types of Tests that Should Be in Your Codebase2017-10-09T00:00:00+00:002017-10-09T00:00:00+00:00http://allyouneedisbackend.com/blog/2017/10/09/no-tests-no-pull-request-types-of-automated-tests-in-backend-software<div class="image-wrapper">
<amp-img media="(min-width: 550px)" src="http://d1vt1c82ljabfd.cloudfront.net/images/no-tests-no-pull-request-disk.png" alt="No Tests - No Pull Request, Right? Types of Tests that Should Be in Your Codebase." class="image-right" width="256" height="320" layout="fixed">
</amp-img>
</div>
<p>As the blog post <strong><a href="http://allyouneedisbackend.com/blog/2017/08/24/pull-requests-good-bad-and-ugly/" target="_blank">Pull Requests: The Good, The Bad and The Ugly</a></strong> claims:</p>
<blockquote>
<h4 id="if-you-do-not-have-time-to-write-tests-today---you-will-find-the-time-for-fixing-bugs-fridays-night">If you do not have time to write tests today - you will find the time for fixing bugs Friday’s night</h4>
</blockquote>
<p>In other words, to establish solid reliability in production tomorrow we need to invest our time today. Your need for tests for your current project depends on:</p>
<ul>
<li>Size of the team that maintains to the codebase: <code class="highlighter-rouge">return True if team.size > 1 else False</code>. Having more engineers means more views on the same items. Tests help to document the opinions how a class or function can be used.</li>
<li>Size of the codebase: <code class="highlighter-rouge">return True if project.modules > 1 else False</code>. You can’t remember the color of socks that you wore two days ago. Can you remember everything in the project?</li>
<li>Duration of development and maintenance phases of the project. The script that you run only once can perfectly live without a solid test coverage. If you’re building a system for decades - please, prepare a good legacy for the next generations of developers.</li>
</ul>
<p>I have a strong feeling that you think that your code needs tests since you’re still reading this.</p>
<p>In the blog post, I will guide you thru types of automated tests that should be implemented by software engineers: <strong>unit</strong>, <strong>integration</strong>, <strong>external</strong>, and <strong>performance</strong> ones. It does not cover testing efforts by quality engineers, but the article can still be valuable for them.</p>
<p>You will find code examples that use Python, but you do not have to know the language.</p>
<p><!--more--></p>
<hr />
<h2 id="what-is-an-automated-test">What is an automated test?</h2>
<p>Software test is a thing that consumes the time that can be rationally used for development of unstable features. Always ask your leadership or business owners what’s preferred for the product. It helps to define proper priorities.</p>
<p>Unexperienced software developers often think that testing it’s something that should be done exclusively by quality assurance team. I tend to disagree. Good engineers own their shit.</p>
<p>In a test, you call a function that is already written and or still does not exist (read more about <strong><a href="https://en.wikipedia.org/wiki/Test-driven_development" target="_blank">Test-Driven Development</a>)</strong>. You pass some parameters and expect the function to return a specific value. If the value is wrong, that means that the test failed and the code is broken. <em>Or the test is implemented poorly.</em></p>
<p>Some programming languages provide the ability to wrap tests into the documentation as Python does. It’s called <strong><a href="https://en.wikipedia.org/wiki/Doctest" target="_blank">doctests</a>)</strong>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">multiply</span><span class="p">(</span><span class="n">s</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">n</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">str</span><span class="p">:</span>
<span class="s">"""Repeats a string multiple times.
Args:
s (str): name to repeat.
n (int): multiplier.
Examples:
>>> multiply('Backend', 2)
'BackendBackend'
>>> multiply('Omn', 3)
'OmnOmnOmn'
"""</span>
<span class="k">return</span> <span class="n">s</span> <span class="o">*</span> <span class="n">n</span>
</code></pre></div></div>
<p>It’s easier to write good tests if you test a <strong><a href="https://en.wikipedia.org/wiki/Pure_function" target="_blank">pure function</a></strong>: the output of a function is completely determined by its inputs. Running pure function has no side effects.</p>
<p>Automated tests do not exist by themselves. They are executed by Continuous Integration servers like <a href="https://www.atlassian.com/software/bamboo" target="_blank">Bamboo</a>, <a href="https://www.g2crowd.com/products/jenkins/reviews" target="_blank">Jenkins</a>, or <a href="https://travis-ci.org/" target="_blank">Travis CI</a>. Usually, the tests are executed for each submitted PR. If the build is green - the branch can be considered to merged into the master branch after code review. Obviously, engineers run tests locally before pushing code. Nobody likes reviewing a priory not working pull requests.</p>
<p>In the next sections, you can find the overview of tests that I recommend to supply with backend software.</p>
<h2 id="unit-tests">Unit tests</h2>
<p>This type of tests is the most popular and the most known. One of the right questions to ask during a job interview for a new company can be “Does your team write unit tests for new code?”.</p>
<p>Jokes aside, the purpose of a unit test is to ensure that an atomic unit of code works as expected. Usually, the unit of code is a function or a method.
Unit tests must be small, fast, keep everything inside one process that runs a test suite. And do not interact with anything else. The type of tests is a great tool when we need to check the correctness of business rules in your code.</p>
<p>Here’s the example of a unit test for the function <code class="highlighter-rouge">parse_fullname</code> that parses full name of a person to get Firstname and Lastname:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">unittest</span> <span class="kn">import</span> <span class="n">TestCase</span>
<span class="kn">from</span> <span class="nn">utils</span> <span class="kn">import</span> <span class="n">parse_fullname</span>
<span class="k">class</span> <span class="nc">ParseFullnameTestCase</span><span class="p">(</span><span class="n">TestCase</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">test_parse_fullname</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">cases</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">(</span><span class="s">'John Doe'</span><span class="p">,</span> <span class="p">(</span><span class="s">'John'</span><span class="p">,</span> <span class="s">'Doe'</span><span class="p">),</span> <span class="s">'first and last name'</span><span class="p">),</span>
<span class="p">(</span><span class="s">'John David Doe'</span><span class="p">,</span> <span class="p">(</span><span class="s">'John David'</span><span class="p">,</span> <span class="s">'Doe'</span><span class="p">),</span> <span class="s">'first, middle, last name'</span><span class="p">),</span>
<span class="p">(</span><span class="s">'John David van Eck de la Nova Doe'</span><span class="p">,</span> <span class="p">(</span><span class="s">'John David van Eck de la Nova'</span><span class="p">,</span> <span class="s">'Doe'</span><span class="p">),</span>
<span class="s">'many name parts'</span><span class="p">),</span>
<span class="p">(</span><span class="s">'John'</span><span class="p">,</span> <span class="p">(</span><span class="s">'John'</span><span class="p">,</span> <span class="s">'John'</span><span class="p">),</span> <span class="s">'single name'</span><span class="p">),</span>
<span class="p">(</span><span class="s">'John David Doe, Jr.'</span><span class="p">,</span> <span class="p">(</span><span class="s">'John David'</span><span class="p">,</span> <span class="s">'Doe Jr'</span><span class="p">),</span> <span class="s">'Jr. suffix'</span><span class="p">),</span>
<span class="p">(</span><span class="s">'John Doe II'</span><span class="p">,</span> <span class="p">(</span><span class="s">'John'</span><span class="p">,</span> <span class="s">'Doe II'</span><span class="p">),</span> <span class="s">'II suffix'</span><span class="p">),</span>
<span class="p">(</span><span class="s">'Mr. John Doe'</span><span class="p">,</span> <span class="p">(</span><span class="s">'John'</span><span class="p">,</span> <span class="s">'Doe'</span><span class="p">),</span> <span class="s">'Mr. prefix'</span><span class="p">),</span>
<span class="p">(</span><span class="s">'Вячеслав Каковський'</span><span class="p">,</span> <span class="p">(</span><span class="s">'Вячеслав'</span><span class="p">,</span> <span class="s">'Каковський'</span><span class="p">),</span> <span class="s">'unicode chars'</span><span class="p">)</span>
<span class="p">]</span>
<span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">expected_output</span><span class="p">,</span> <span class="n">description</span> <span class="ow">in</span> <span class="n">cases</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">assertEqual</span><span class="p">(</span><span class="n">parse_fullname</span><span class="p">(</span><span class="n">name</span><span class="p">),</span> <span class="n">expected_output</span><span class="p">,</span> <span class="n">msg</span><span class="o">=</span><span class="s">'Failed for {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">description</span><span class="p">))</span>
</code></pre></div></div>
<p>The test checks if the returned value matches the expected one for each of the cases and provides the explanation when an assertion is failed.</p>
<p>Again, it’s better if unit tests are fast. For example, execution of hundreds of unittests for production software takes seconds, rarely minutes.</p>
<p>How to make good unit tests without side effects?
We can use <strong><a href="https://en.wikipedia.org/wiki/Dependency_injection" target="_blank">Dependency Injection</a></strong> to substitute objects that perform heavy operations with <strong><a href="https://martinfowler.com/articles/mocksArentStubs.html" target="_blank">Mocks, Stubs, or Fake objects</a></strong>. Yes, your unit test should not perform I/O operations, like reading/writing data from a database, performing HTTP calls and so on. Check a unit test for the <code class="highlighter-rouge">@retry</code> decorator that tries to reattempt execution if an exception of specified type occurred.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">aiohttp</span> <span class="kn">import</span> <span class="n">DisconnectedError</span>
<span class="kn">from</span> <span class="nn">asynctest</span> <span class="kn">import</span> <span class="n">TestCase</span><span class="p">,</span> <span class="n">CoroutineMock</span> <span class="k">as</span> <span class="n">Mock</span>
<span class="kn">from</span> <span class="nn">utils</span> <span class="kn">import</span> <span class="n">retry</span>
<span class="k">class</span> <span class="nc">RetryTest</span><span class="p">(</span><span class="n">TestCase</span><span class="p">):</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">test_retry</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_func</span> <span class="o">=</span> <span class="n">Mock</span><span class="p">(</span><span class="n">return_value</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span>
<span class="n">side_effect</span><span class="o">=</span><span class="p">[</span><span class="n">DisconnectedError</span><span class="p">,</span>
<span class="n">DisconnectedError</span><span class="p">,</span> <span class="mi">200</span><span class="p">]</span>
<span class="nd">@retry</span><span class="p">(</span><span class="n">DisconnectedError</span><span class="p">)</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">get_http_status</span><span class="p">():</span>
<span class="k">return</span> <span class="n">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_func</span><span class="p">()</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">await</span> <span class="n">get_http_status</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">assertEqual</span><span class="p">(</span><span class="n">res</span><span class="p">,</span> <span class="mi">200</span><span class="p">)</span>
</code></pre></div></div>
<p>We use <code class="highlighter-rouge">Mock</code> object to introduce a function with the predefined behavior: raising <code class="highlighter-rouge">DisconnectedError</code> two times and returning status code 200 that means successful HTTP-request. Thankfully, we do not have to perform the actual request to some web server and do all slow I/O work. Also, we do not need to perform some tweaks with configuring the server or load balancer to break the connection for each execution of the test.</p>
<p>I encourage you to read about the <code class="highlighter-rouge">retry</code> function in my another blog post <strong><a href="http://allyouneedisbackend.com/blog/2017/09/15/how-backend-software-should-retry-on-failures/" target="_blank">Never Give Up, Retry: How Software Should Deal with Failures</a></strong>. I found the technique very useful during making backend that depends on various other services.</p>
<p>Examples from real life when a unit test is a good fit:</p>
<ul>
<li>all types of parsing: messages, documents, arguments, and configuration</li>
<li>checking business rules and corner cases</li>
<li>input validation or other verification of chains of complex if-else statements</li>
<li>calculation of math formulas, like business rules for discounts</li>
<li>complex data transformations from one format to another</li>
<li>verification of SQL-queries compiled by ORM (do not mix with execution of the queries against a database)</li>
<li>when you need to check that invocation of one function leads to a call of another one; I highly recommend to use mocks for that</li>
<li>verification of firing network operations, but do not forget to replace actual I/O operations with stubs.</li>
</ul>
<p>For most of the modern programming languages, you can find great toolbox for writing good unit tests fast. It might include:</p>
<ul>
<li>primitives for implementing Mocks, Stubs, and Fake objects</li>
<li>hooks for running before/after a test in test suite</li>
<li>utility for running a set of tests from command line</li>
<li>tools for running tests under various environments, like versions of interpreter/virtual machine.</li>
</ul>
<h2 id="integration-tests">Integration tests</h2>
<p>The main purpose of the type of tests - verify cooperation between various modules and components that <em>you develop</em>. Here you’re encouraged to perform I/O operations in your tests, therefore, the test suites might be running slow.</p>
<p>These tests are focused on API contract on your subsystems as well as integration with the data storages that you use.</p>
<blockquote>
<h4 id="the-main-feature-of-integration-tests-for-me-is-that-they-do-not-have-to-run-only-inside-one-process-tested-code-can-perform-a-syscall-or-execute-a-query-against-a-real-database">The main feature of integration tests for me is that they do not have to run only inside one process: tested code can perform a syscall or execute a query against a real database.</h4>
</blockquote>
<p>Check out integration tests for <code class="highlighter-rouge">SQLAlchemyEngine</code> class that implements database wrappers for the high-level methods:</p>
<ul>
<li><code class="highlighter-rouge">execute</code>: executes SQLAlchemy query, return the number of affected rows</li>
<li><code class="highlighter-rouge">fetchone</code>: shorthand for fetching one DB entry</li>
<li><code class="highlighter-rouge">fetchall</code>: shorthand for fetching all DB entries.</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">uuid</span>
<span class="kn">from</span> <span class="nn">asynctest</span> <span class="kn">import</span> <span class="n">TestCase</span>
<span class="kn">from</span> <span class="nn">database</span> <span class="kn">import</span> <span class="n">SQLAlchemyEngine</span><span class="p">,</span> <span class="n">users</span>
<span class="kn">from</span> <span class="nn">utils</span> <span class="kn">import</span> <span class="n">get_config</span>
<span class="k">class</span> <span class="nc">DBEngineTests</span><span class="p">(</span><span class="n">TestCase</span><span class="p">):</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">setUp</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_user_data</span> <span class="o">=</span> <span class="p">{</span><span class="s">'user_id'</span><span class="p">:</span> <span class="mi">100500</span><span class="p">,</span>
<span class="s">'external_id'</span><span class="p">:</span> <span class="n">uuid</span><span class="o">.</span><span class="n">uuid4</span><span class="p">()</span><span class="o">.</span><span class="nb">hex</span><span class="p">,</span>
<span class="s">'name'</span><span class="p">:</span> <span class="s">'Viach'</span><span class="p">}</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_db_engine</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_get_db_engine</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_user</span> <span class="o">=</span> <span class="n">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_create_user</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_user_data</span><span class="p">)</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">tearDown</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">async</span> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">_db_engine</span><span class="o">.</span><span class="n">acquire</span><span class="p">()</span> <span class="k">as</span> <span class="n">conn</span><span class="p">:</span>
<span class="n">await</span> <span class="n">conn</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">delete</span><span class="p">())</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">test_fetchone</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">fetched_user</span> <span class="o">=</span> <span class="n">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_db_engine</span><span class="o">.</span><span class="n">fetchone</span><span class="p">(</span>
<span class="n">users</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="nb">id</span> <span class="o">==</span> <span class="bp">self</span><span class="o">.</span><span class="n">_user</span><span class="p">[</span><span class="s">'id'</span><span class="p">]))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">assertEqual</span><span class="p">(</span><span class="n">fetched_user</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">_user_data</span><span class="p">)</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">test_execute_delete_user</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">rowcount</span> <span class="o">=</span> <span class="n">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_db_engine</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span>
<span class="n">users</span><span class="o">.</span><span class="n">delete</span><span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="nb">id</span> <span class="o">==</span> <span class="bp">self</span><span class="o">.</span><span class="n">_user</span><span class="p">[</span><span class="s">'id'</span><span class="p">]))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">assertEqual</span><span class="p">(</span><span class="n">rowcount</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">fetched_user</span> <span class="o">=</span> <span class="n">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_db_engine</span><span class="o">.</span><span class="n">fetchone</span><span class="p">(</span>
<span class="n">users</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="nb">id</span> <span class="o">==</span> <span class="bp">self</span><span class="o">.</span><span class="n">_user</span><span class="p">[</span><span class="s">'id'</span><span class="p">]))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">assertIsNone</span><span class="p">(</span><span class="n">fetched_user</span><span class="p">)</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">test_fetchone_not_exists</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">fetched_user</span> <span class="o">=</span> <span class="n">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_db_engine</span><span class="o">.</span><span class="n">fetchone</span><span class="p">(</span>
<span class="n">users</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="nb">id</span> <span class="o">==</span> <span class="bp">self</span><span class="o">.</span><span class="n">_user</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">assertIsNone</span><span class="p">(</span><span class="n">fetched_user</span><span class="p">)</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">test_fetchall</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">fetched_users</span> <span class="o">=</span> <span class="n">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_db_engine</span><span class="o">.</span><span class="n">fetchall</span><span class="p">(</span>
<span class="n">users</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="nb">id</span> <span class="o">==</span> <span class="bp">self</span><span class="o">.</span><span class="n">_user</span><span class="p">[</span><span class="s">'id'</span><span class="p">]))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">assertLenEqual</span><span class="p">(</span><span class="n">fetched_users</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">assertEqual</span><span class="p">(</span><span class="n">fetched_users</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">_user_data</span><span class="p">)</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">_create_user</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">user_data</span><span class="p">):</span>
<span class="n">async</span> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">_db_engine</span><span class="o">.</span><span class="n">acquire</span><span class="p">()</span> <span class="k">as</span> <span class="n">conn</span><span class="p">:</span>
<span class="n">await</span> <span class="n">conn</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">insert</span><span class="p">()</span><span class="o">.</span><span class="n">values</span><span class="p">(</span><span class="o">**</span><span class="n">user_data</span><span class="p">))</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">await</span> <span class="n">conn</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">select</span><span class="p">()</span><span class="o">.</span><span class="n">where</span><span class="p">(</span>
<span class="n">users</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="nb">id</span> <span class="o">==</span> <span class="n">user_data</span><span class="p">[</span><span class="s">'id'</span><span class="p">]))</span>
<span class="k">return</span> <span class="n">result</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">_get_db_engine</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="n">await</span> <span class="n">SQLAlchemyEngine</span><span class="o">.</span><span class="n">from_config</span><span class="p">(</span><span class="n">get_config</span><span class="p">()[</span><span class="s">'mysql'</span><span class="p">][</span><span class="s">'userbase'</span><span class="p">],</span>
<span class="n">loop</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">loop</span><span class="p">)</span>
</code></pre></div></div>
<p>I think that an integration test is a perfect idea when you need to verify database-related code:</p>
<ul>
<li>embedding a third-party driver for a datastore in your codebase; a smoke test that inserts a record and fetches that is usually enough</li>
<li>complex queries that depend on the state of database; do not forget to set the state in a pre-test hook</li>
<li>not-complex queries in the case when you do not use ORM and cannot check compiled statements (using ORM you can do this with unit tests)</li>
<li>homebrew wrappers/patches of existing database drivers.</li>
</ul>
<p>Other possible applications of integration tests from my experience:</p>
<ul>
<li>testing of contracts between your subsystems, like public interfaces between modules</li>
<li>verification of communications between your services; say, a test that ensures that a service performs a request against another one for some task.</li>
</ul>
<p>Note, that you still can and should use mocks to replace some parts of the software to make the establishing environment for tests easier and execution of tests faster. It helps to keep time for running the type of tests in the range between minutes and few dozens of minutes.</p>
<p>We reviewed unit and integration tests, the purpose of the first category is to verify that individual components work as expected; the reason to write the second ones - check that combination of the pieces that you implemented plays as a team.</p>
<p>But what if your product involves the software not written by your team and runs not under control of the organization? Time to look into the next category of tests.</p>
<h2 id="external-tests">External tests</h2>
<p>You should write external tests when you need to track contracts between your software and third-party services that cannot be controlled by your team. It can be services maintained by other teams inside your company or software that runs on the infrastructure of your vendors, partners, or even competitors.</p>
<p>Real examples of things to be tested with an external test:</p>
<ul>
<li>API contracts between teams in your company</li>
<li>usage of services provided by your cloud provider; it’s AWS in my case;</li>
<li>integrations with Developer APIs of services like <a href="https://developer.atlassian.com/cloud/stride/" target="_blank">Stride</a>, <a href="https://developer.atlassian.com/hipchat" target="_blank">Hipchat</a>, <a href="https://developer.atlassian.com/bitbucket/api/2/reference/" target="_blank">Bitbucket</a>, <a href="https://developers.trello.com/v1.0/reference" target="_blank">Trello</a>, <a href="https://developer.github.com/" target="_blank">GitHub</a>, <a href="https://api.slack.com/" target="_blank">Slack</a>, etc.</li>
</ul>
<p>You might be interested why external tests are in a separate category instead of being a part of integration tests?</p>
<p>Firstly, not necessary that your team can fix a failure of an external test. If the system is broken on the other end - you can only file a bug report and try to prioritize it.</p>
<p>Secondary, since you do not control the environment on the other end some failures can be random: issues with availability, poor deployments, etc.</p>
<p>Check out the example of an external test for verification of code that works with Amazon SQS:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">class</span> <span class="nc">SQSTestCase</span><span class="p">(</span><span class="n">TestCase</span><span class="p">):</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">setUp</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">config</span> <span class="o">=</span> <span class="n">get_config</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">client</span> <span class="o">=</span> <span class="n">await</span> <span class="n">SQSClient</span><span class="o">.</span><span class="n">from_config</span><span class="p">(</span><span class="n">config</span><span class="p">,</span> <span class="n">loop</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">loop</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">tearDown</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">client</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">test_send_and_receive</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">msg</span> <span class="o">=</span> <span class="p">{</span><span class="s">'id'</span><span class="p">:</span> <span class="mi">100500</span><span class="p">,</span>
<span class="s">'description'</span><span class="p">:</span> <span class="s">'some important SQS task'</span><span class="p">}</span>
<span class="n">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">client</span><span class="o">.</span><span class="n">send_message</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">client</span><span class="o">.</span><span class="n">receive_messages</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">assertEqual</span><span class="p">(</span><span class="n">result</span><span class="p">,</span> <span class="p">[</span><span class="n">msg</span><span class="p">])</span>
</code></pre></div></div>
<p>In <em>some cases</em> it can be okay to ignore failures of external tests to do not block deployments. But it’s still required to figure out the reason of red builds.</p>
<h2 id="performance-tests">Performance tests</h2>
<blockquote>
<h4 id="the-purpose-of-performance-testing-is-to-predict-when-we-fuk-production">The purpose of performance testing is to predict when we fu*k production.</h4>
</blockquote>
<p>In other words, it helps to find out conditions when our algorithms, architecture, or infrastructure cannot handle load properly.
I believe that performance testing is a must for a high load system. I published a short blog <strong><a href="http://allyouneedisbackend.com/blog/2017/08/30/what-is-highload/" target="_blank">What Is a Highload Project?</a></strong> about my definition of the term, check out if you’re interested.</p>
<p>Look into the steps to add performance testing into your workflow.</p>
<ol>
<li><strong>Identify</strong> how the load might grow up. Possible cases:
<ul>
<li>More users start using the software that you build.</li>
<li>More data is sent thru your processing pipeline.</li>
<li>You need to shrink capacity of your servers because of changes in your budgeting.</li>
<li>All sorts of unexpected edge cases.</li>
</ul>
</li>
<li><strong>Define</strong> the most heavy and frequent operations. Examples:
<ul>
<li>Insertions into data storages.</li>
<li>Calculations and other CPU-bound tasks.</li>
<li>Calls to external services.</li>
</ul>
</li>
<li><strong>Identify</strong> how to trigger the operations above from a user’s perspective. For example:
<ul>
<li>HTTP/XMPP/your-favorite-protocol handlers.</li>
<li>REST API endpoints.</li>
<li>Periodic processing of collected data.</li>
</ul>
</li>
<li><strong>Setup collecting of product metrics</strong> for the identified operations:
<ul>
<li>Application metrics can be collected using StatsD. The most often I use counters and timers.</li>
<li>For per-instance metrics, I can recommend CollectD. Top 5 metrics to look into: Load Average, CPU, RAM, Bytes received/sent, and Free disk.</li>
</ul>
</li>
<li><strong>Create a tool</strong> that behaves like gazillions customers using your product and triggering the heavy operations. For a web server such actions can be:
<ul>
<li>Establishing network connections.</li>
<li>Making HTTP requests, sending data and retrieving information.</li>
</ul>
</li>
<li><strong>Run the tooling</strong> in the staging environment with enabled metrics collection. Roll out the load gracefully to investigate the behaviour of your system, pay attention to any spike. You also can test autoscaling of the infrastructure if you use any.</li>
</ol>
<p>Possible results of a successful session of performance testing:</p>
<ul>
<li>You know how many requests per second can be served by a particular configuration of infrastructure.</li>
<li>You know how the system behaves when the limit is exceeded.</li>
<li>You see the bottlenecks of the platform.</li>
<li>You understand if some part of the system can be scaled.</li>
</ul>
<p><a href="https://locust.io/" target="_blank">LocustIO</a> can be a good thing to start implementation of performance testing. It’s written in Python, runs load tests distributed over multiple hosts and support various protocols including <a href="https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol" target="_blank">HTTP</a>, <a href="https://en.wikipedia.org/wiki/XMPP" target="_blank">XMPP</a>, and <a href="https://en.wikipedia.org/wiki/XML-RPC" target="_blank">XML-RPC</a>.</p>
<h2 id="summary">Summary</h2>
<p>In the blog post, we briefly introduced four types of tests. From my experience, the tests should be provided by the engineering teams that are actively involved in development of your product, not a separate quality engineering team.</p>
<p>Check out the summary about each kind of tests below.</p>
<p><strong>Unit tests:</strong></p>
<ul>
<li>Small and isolated.</li>
<li>Keep everything within one process, code in tests should not lead to system calls.</li>
<li>Extremely fast.</li>
<li>Good fit for checking business rules.</li>
<li>Run inside your development environment.</li>
</ul>
<p><strong>Integration tests:</strong></p>
<ul>
<li>The main purpose - verify that various components work well together.</li>
<li>Can perform I/O operations.</li>
<li>Slow.</li>
<li>Execution flow can be distributed across processes.</li>
<li>Run in your development environment.</li>
</ul>
<p><strong>External tests:</strong></p>
<ul>
<li>Ensure that reached API contracts are implemented properly.</li>
<li>Involve calls to software that runs out of your direct control.</li>
<li>Run between your development environment and third-party. servers.</li>
<li>Tend to be very slow.</li>
</ul>
<p><strong>Performance tests:</strong></p>
<ul>
<li>Help to save reputation of your business if the product can be under high load.</li>
<li>Require additional preparations but worth it.</li>
<li>Are executed in staging environment.</li>
<li>Very very slow, should be run within scheduled windows.</li>
</ul>
<p>I think that each mature programming language has own variation of <a href="https://en.wikipedia.org/wiki/XUnit" target="_blank">xUnit</a>-like toolset for writing automated tests for fun and profit.</p>
<p>I hope you found the practical examples in the blog post useful for your team. During the last three years, our teams invested a lot of resources into providing various types of tests as a part of a pull request. We found this rewarding and valuable for our product.</p>
<p>The teams feel more healthy working in the environment when we have enough test coverage since we’re protected from code regression.</p>
<p><strong>What types of tests do you provide for your code?</strong></p>
<p><strong>How much time do you spend dealing with fixing features that were delivered a while ago?</strong></p>As the blog post Pull Requests: The Good, The Bad and The Ugly claims: If you do not have time to write tests today - you will find the time for fixing bugs Friday’s night In other words, to establish solid reliability in production tomorrow we need to invest our time today. Your need for tests for your current project depends on: Size of the team that maintains to the codebase: return True if team.size > 1 else False. Having more engineers means more views on the same items. Tests help to document the opinions how a class or function can be used. Size of the codebase: return True if project.modules > 1 else False. You can’t remember the color of socks that you wore two days ago. Can you remember everything in the project? Duration of development and maintenance phases of the project. The script that you run only once can perfectly live without a solid test coverage. If you’re building a system for decades - please, prepare a good legacy for the next generations of developers. I have a strong feeling that you think that your code needs tests since you’re still reading this. In the blog post, I will guide you thru types of automated tests that should be implemented by software engineers: unit, integration, external, and performance ones. It does not cover testing efforts by quality engineers, but the article can still be valuable for them. You will find code examples that use Python, but you do not have to know the language.The SQL I Love <3. Efficient pagination of a table with 100M records2017-09-24T00:00:00+00:002017-09-24T00:00:00+00:00http://allyouneedisbackend.com/blog/2017/09/24/the-sql-i-love-part-1-scanning-large-table<div class="image-wrapper">
<amp-img media="(min-width: 550px)" src="http://d1vt1c82ljabfd.cloudfront.net/images/sql-i-love.jpg" alt="The SQL Queries I Loved <3" class="image-right" width="298" height="270" layout="fixed">
</amp-img>
</div>
<p>I am a huge fan of databases. I even wanted to make my own DBMS when I was in university. Now I work both with <a href="https://en.wikipedia.org/wiki/Relational_database_management_system" target="_blank">RDBMS</a> and <a href="https://en.wikipedia.org/wiki/NoSQL" target="_blank">NoSQL</a> solutions, and I am very enthusiastic with that. You know, there’s no <a href="https://en.wikipedia.org/wiki/Law_of_the_instrument" target="_blank">Golden Hammer</a>, each problem has own solution. Alternatively, a subset of solutions.</p>
<p>In the series of blog posts <strong>The SQL I Love <3</strong> I walk you thru some problems solved with SQL which I found particularly interesting. The solutions are tested using a table with more than 100 million records. All the examples use MySQL, but ideas apply to other relational data stores like PostgreSQL, Oracle and SQL Server.</p>
<p>This Chapter is focused on efficient scanning a large table using pagination with <code class="highlighter-rouge">offset</code> on the primary key. This is also known as <strong>keyset pagination</strong>.</p>
<p><!--more--></p>
<hr />
<h2 id="background">Background</h2>
<p>In the chapter, we use the following database structure for example. The canonical example about users should fit any domain.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="nv">`users`</span> <span class="p">(</span>
<span class="nv">`user_id`</span> <span class="n">int</span><span class="p">(</span><span class="mi">11</span><span class="p">)</span> <span class="n">unsigned</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="n">AUTO_INCREMENT</span><span class="p">,</span>
<span class="nv">`external_id`</span> <span class="n">varchar</span><span class="p">(</span><span class="mi">32</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
<span class="nv">`name`</span> <span class="n">varchar</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span> <span class="k">COLLATE</span> <span class="n">utf8_unicode_ci</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
<span class="nv">`metadata`</span> <span class="n">text</span> <span class="k">COLLATE</span> <span class="n">utf8_unicode_ci</span><span class="p">,</span>
<span class="nv">`date_created`</span> <span class="k">timestamp</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="k">DEFAULT</span> <span class="k">CURRENT_TIMESTAMP</span><span class="p">,</span>
<span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="p">(</span><span class="nv">`user_id`</span><span class="p">),</span>
<span class="k">UNIQUE</span> <span class="k">KEY</span> <span class="nv">`uf_uniq_external_id`</span> <span class="p">(</span><span class="nv">`external_id`</span><span class="p">),</span>
<span class="k">UNIQUE</span> <span class="k">KEY</span> <span class="nv">`uf_uniq_name`</span> <span class="p">(</span><span class="nv">`name`</span><span class="p">),</span>
<span class="k">KEY</span> <span class="nv">`date_created`</span> <span class="p">(</span><span class="nv">`date_created`</span><span class="p">)</span>
<span class="p">)</span> <span class="n">ENGINE</span><span class="o">=</span><span class="n">InnoDB</span> <span class="k">DEFAULT</span> <span class="n">CHARSET</span><span class="o">=</span><span class="n">utf8</span> <span class="k">COLLATE</span><span class="o">=</span><span class="n">utf8_unicode_ci</span><span class="p">;</span>
</code></pre></div></div>
<p>A few comments about the structure:</p>
<ul>
<li><code class="highlighter-rouge">external_id</code> column stores reference to the same user in other system in UUID format</li>
<li><code class="highlighter-rouge">name</code> represents <code class="highlighter-rouge">Firstname Lastname</code></li>
<li><code class="highlighter-rouge">metadata</code> column contains JSON blob with all kinds of unstructured data</li>
</ul>
<p>The table is relatively large and contains around 100 000 000 records. Let’s start our learning journey.</p>
<h2 id="scanning-a-large-table">Scanning a Large Table</h2>
<p><strong>Problem</strong>: You need to walk thru the table, extract each record, transform it inside your application’s code and insert to another place. We focus on the first stage in the post - <em>scanning the table</em>.</p>
<p><strong>Obvious and wrong solution</strong></p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">external_id</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">metadata</span><span class="p">,</span> <span class="n">date_created</span>
<span class="k">FROM</span> <span class="n">users</span><span class="p">;</span>
</code></pre></div></div>
<p>In my case with 100 000 000 records, the query is never finished. The DBMS just kills it. Why? Probably, because it led to the attempt to load the whole table into RAM. Before returning data to the client. Another assumption - it took too much time to pre-load the data before sending and the query was timed out.
Anyway, our attempt to get all records in time failed. We need to find some other solution.</p>
<p><strong>Solution #2</strong></p>
<p>We can try to get the data in pages. Since records are not guaranteed to be ordered in a table on physical or logical level - we need to sort them on the DBMS side with <code class="highlighter-rouge">ORDER BY</code> clause.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">external_id</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">metadata</span><span class="p">,</span> <span class="n">date_created</span>
<span class="k">FROM</span> <span class="n">users</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">user_id</span> <span class="k">ASC</span>
<span class="k">LIMIT</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">10</span> <span class="mi">000</span><span class="p">;</span>
<span class="mi">10</span> <span class="mi">000</span> <span class="k">rows</span> <span class="k">in</span> <span class="k">set</span> <span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">03</span> <span class="n">sec</span><span class="p">)</span>
</code></pre></div></div>
<p>Sweet. It worked. We asked the first page of 10 000 records, and it took only <code class="highlighter-rouge">0.03</code> sec to return it. However, how it would work for the 5000th page?</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">external_id</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">metadata</span><span class="p">,</span> <span class="n">date_created</span>
<span class="k">FROM</span> <span class="n">users</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">user_id</span> <span class="k">ASC</span>
<span class="k">LIMIT</span> <span class="mi">50</span> <span class="mi">000</span> <span class="mi">000</span><span class="p">,</span> <span class="mi">10</span> <span class="mi">000</span><span class="p">;</span> <span class="c1">--- 5 000th page * 10 000 page size</span>
<span class="mi">10</span> <span class="mi">000</span> <span class="k">rows</span> <span class="k">in</span> <span class="k">set</span> <span class="p">(</span><span class="mi">40</span><span class="p">.</span><span class="mi">81</span> <span class="n">sec</span><span class="p">)</span>
</code></pre></div></div>
<p>Indeed, this is very slow. Let’s see how much time is needed to get the data for the latest page.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">external_id</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">metadata</span><span class="p">,</span> <span class="n">date_created</span>
<span class="k">FROM</span> <span class="n">users</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">user_id</span> <span class="k">ASC</span>
<span class="k">LIMIT</span> <span class="mi">99</span> <span class="mi">990</span> <span class="mi">000</span><span class="p">,</span> <span class="mi">10</span> <span class="mi">000</span><span class="p">;</span> <span class="c1">--- 9999th page * 10 000 page size</span>
<span class="mi">10</span> <span class="mi">000</span> <span class="k">rows</span> <span class="k">in</span> <span class="k">set</span> <span class="p">(</span><span class="mi">1</span> <span class="k">min</span> <span class="mi">20</span><span class="p">.</span><span class="mi">61</span> <span class="n">sec</span><span class="p">)</span>
</code></pre></div></div>
<p>This is insane. However, can be OK for solutions that run in background. One more hidden problem with the approach can be revealed if you try to delete a record from the table in the middle of scanning it. Say, you finished the 10th page (100 000 records are already visited), going to scan the records between 100 001 and 110 000. But records 99 998 and 99 999 are deleted before the next <code class="highlighter-rouge">SELECT</code> execution. In that case, the following query returns the unexpected result:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">SELECT</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">external_id</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">metadata</span><span class="p">,</span> <span class="n">date_created</span>
<span class="k">FROM</span> <span class="n">users</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">user_id</span> <span class="k">ASC</span>
<span class="k">LIMIT</span> <span class="mi">100</span> <span class="mi">000</span><span class="p">,</span> <span class="mi">10</span> <span class="mi">000</span><span class="p">;</span>
<span class="n">N</span><span class="p">,</span> <span class="n">id</span><span class="p">,</span> <span class="p">...</span>
<span class="mi">1</span><span class="p">,</span> <span class="mi">100</span> <span class="mi">003</span><span class="p">,</span> <span class="p">...</span>
<span class="mi">2</span><span class="p">,</span> <span class="mi">100</span> <span class="mi">004</span><span class="p">,</span> <span class="p">...</span>
</code></pre></div></div>
<p>As you can see, the query skipped the records with ids 100 001 and 100 002. They will not be processed by application’s code with the approach because after the two delete operations they appear in the first 100 000 records. Therefore, the method is unreliable if the dataset is mutable.</p>
<p><strong>Solution #3 - the final one for today</strong></p>
<p>The approach is very similar to the previous one because it still uses paging, but now instead of relying on the number of scanned records, we use the <code class="highlighter-rouge">user_id</code> of the latest visited record as the <code class="highlighter-rouge">offset</code>.</p>
<p>Simplified algorithm:</p>
<ol>
<li>We get <code class="highlighter-rouge">PAGE_SIZE</code> number of records from the table. Starting offset value is 0.</li>
<li>Use the max returned value for <code class="highlighter-rouge">user_id</code> in the batch as the offset for the next page.</li>
<li>Get the next batch from the records which have <code class="highlighter-rouge">user_id</code> value higher than current <code class="highlighter-rouge">offset</code>.</li>
</ol>
<p>The query in action for 5 000th page, each page contains data about 10 000 users:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">external_id</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">metadata</span><span class="p">,</span> <span class="n">date_created</span>
<span class="k">FROM</span> <span class="n">users</span>
<span class="k">WHERE</span> <span class="n">user_id</span> <span class="o">></span> <span class="mi">51</span> <span class="mi">234</span> <span class="mi">123</span> <span class="c1">--- value of user_id for 50 000 000th record</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">user_id</span> <span class="k">ASC</span>
<span class="k">LIMIT</span> <span class="mi">10</span> <span class="mi">000</span><span class="p">;</span>
<span class="mi">10</span> <span class="mi">000</span> <span class="k">rows</span> <span class="k">in</span> <span class="k">set</span> <span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">03</span> <span class="n">sec</span><span class="p">)</span>
</code></pre></div></div>
<blockquote>
<h4 id="wow-it-is-significantly-faster-than-the-previous-approach-more-than-1000-times">Wow, it is significantly faster than the previous approach. More than 1000 times.</h4>
</blockquote>
<p>Note, that the values of <code class="highlighter-rouge">user_id</code> are not sequential and can have gaps like 25 348 is right after 25 345. The solution also works if any records from future pages are deleted - even in that case query does not skip records. Sweet, right?</p>
<h2 id="explaining-performance">Explaining performance</h2>
<p>For further learning, I recommend investigating results of <code class="highlighter-rouge">EXPLAIN EXTENDED</code> for each version of the query to get the next 10 000 records after 50 000 000.</p>
<table>
<thead>
<tr>
<th>Solution</th>
<th>Time</th>
<th>Type</th>
<th>Keys</th>
<th>Rows</th>
<th>Filtered</th>
<th>Extra</th>
</tr>
</thead>
<tbody>
<tr>
<td>1. Obvious</td>
<td>Never</td>
<td>ALL</td>
<td>NULL</td>
<td>100M</td>
<td>100.00</td>
<td>NULL</td>
</tr>
<tr>
<td>2. Paging using number of records as offset</td>
<td>40.81 sec</td>
<td>index</td>
<td>NULL / PRIMARY</td>
<td>50M</td>
<td>200.00</td>
<td>NULL</td>
</tr>
<tr>
<td>3. Keyset pagination using user_id as offset</td>
<td>0.03 sec</td>
<td>range</td>
<td>PRIMARY / PRIMARY</td>
<td>50M</td>
<td>100.00</td>
<td>Using where</td>
</tr>
</tbody>
</table>
<p>Let’s focus on the key difference between execution plans for 2nd and 3rd solutions since the 1st one is not practically useful for large tables.</p>
<ul>
<li><strong>Join type</strong>: <code class="highlighter-rouge">index</code> vs <code class="highlighter-rouge">range</code>. The first one means that whole index tree is scanned to find the records. <code class="highlighter-rouge">range</code> type tells us that index is used only to find matching rows within a specified range. So, <code class="highlighter-rouge">range</code> type is faster than <code class="highlighter-rouge">index</code>.</li>
<li><strong>Possible keys</strong>: <code class="highlighter-rouge">NULL</code> vs <code class="highlighter-rouge">PRIMARY</code>. The column shows the keys that can be used by MySQL. BTW, looking into <strong>keys</strong> column, we can see that eventually <code class="highlighter-rouge">PRIMARY</code> key is used for the both queries.</li>
<li><strong>Rows</strong>: <code class="highlighter-rouge">50 010 000</code> vs <code class="highlighter-rouge">50 000 000</code>. The value displays a number of records analyzed before returning the result. For the 2nd query, the value depends on how deep is our scroll. For example, if we try to get the next <code class="highlighter-rouge">10 000</code> records after 9999th page then <code class="highlighter-rouge">99 990 000</code> records are examined. In opposite, the 3rd query has a constant value; it does not matter if we load data for the 1st page of the very last one. It is always half size of the table.</li>
<li><strong>Filtered</strong>: <code class="highlighter-rouge">200.00</code> vs <code class="highlighter-rouge">100.00</code>. The column indicates estimated the percentage of the table to be filtered before processing. Having the higher value is better. The value of <code class="highlighter-rouge">100.00</code> means that the query looks thru the whole table. For the 2nd query, the value is not constant and depends on the page number: if we ask 1st page the value of filtered column would be <code class="highlighter-rouge">1000000.00</code>. For the very last page, it would be <code class="highlighter-rouge">100.00</code>.</li>
<li><strong>Extra</strong>: <code class="highlighter-rouge">NULL</code> vs <code class="highlighter-rouge">Using where</code>. Provides additional information about how MySQL resolves the query. Usage of <code class="highlighter-rouge">WHERE</code> on <code class="highlighter-rouge">PRIMARY</code> key make the query execution faster.</li>
</ul>
<p>I suspect that <strong>join type</strong> is the parameter of the query that made the largest contribution to performance to make the 3rd query faster. Another important thing is that the 2nd query is extremely dependent on the number of the pages to scroll. More deep pagination is slower in that case.</p>
<p>More guidance about understaing output for <code class="highlighter-rouge">EXPLAIN</code> command can be found in <a href="https://dev.mysql.com/doc/refman/5.6/en/explain-output.html" target="_blank">the official documentation for your RDBMS</a>.</p>
<h2 id="summary">Summary</h2>
<p>The main topic for the blog post was related to scanning a large table with 100 000 000 records using <code class="highlighter-rouge">offset</code> with a primary key (keyset pagination). Overall, 3 different approaches were reviewed and tested on the corresponding dataset. I recommend only one of them if you need to scan a mutable large table.</p>
<p>Also, we revised usage of <code class="highlighter-rouge">EXPLAIN EXTENDED</code> command to analyze execution plan of MySQL queries. I am sure that other RDBMS have analogs for the functionality.</p>
<p>In the next chapter, we will pay attention to data aggregation and storage optimization. Stay tuned!</p>
<p><strong>What’s your method of scanning large tables?</strong></p>
<p><strong>Do you remember any other purpose of using keyset pagination like in Solution #3?</strong></p>
<p><a href="http://use-the-index-luke.com/no-offset" target="_blank">
<img src="http://use-the-index-luke.com/static/no-offset-banner-468x60.white.-ImIEcEh.png" width="468" height="60" alt="Do not use OFFSET for pagination" />
</a></p>I am a huge fan of databases. I even wanted to make my own DBMS when I was in university. Now I work both with RDBMS and NoSQL solutions, and I am very enthusiastic with that. You know, there’s no Golden Hammer, each problem has own solution. Alternatively, a subset of solutions. In the series of blog posts The SQL I Love <3 I walk you thru some problems solved with SQL which I found particularly interesting. The solutions are tested using a table with more than 100 million records. All the examples use MySQL, but ideas apply to other relational data stores like PostgreSQL, Oracle and SQL Server. This Chapter is focused on efficient scanning a large table using pagination with offset on the primary key. This is also known as keyset pagination.Never Give Up, Retry: How Software Should Deal with Failures2017-09-15T00:00:00+00:002017-09-15T00:00:00+00:00http://allyouneedisbackend.com/blog/2017/09/15/how-backend-software-should-retry-on-failures<div class="image-wrapper">
<amp-img media="(min-width: 550px)" src="http://d1vt1c82ljabfd.cloudfront.net/images/never-give-up-retry.png" alt="Never Give Up, Retry: How Backend Software Should Deal with Failures" class="image-right" width="256" height="320" layout="fixed">
</amp-img>
</div>
<p>It’s doubtful that your backend has everything within one process: you need to read configuration, store customers’ data, write logs and metrics about the status of your software.</p>
<p>If you’re working on a network application - it’s even more complicated: your database can be far far away from the running code.</p>
<p>Some things can go wrong: a network blip might happen, the remote database can be overloaded by incoming requests, a query can reveal some bug in the DBMS and crash it, your data can be out of order on that side because of some reason, and so on.</p>
<p>Microservice architecture encourages cross-process communications over the network. Now your service asks another one for its configuration, that is stored somewhere in the database. You should prepare the software to non-deterministic failures that might occur during the data transfer. And not only then.</p>
<p>In the blog post, we will look into some common failures that can be solved with proper retrying. The basic ideas are described using Python, but experience with the language is not required for understanding.</p>
<!--more-->
<hr />
<h2 id="failures-when-retry-might-help">Failures When Retry Might Help</h2>
<p>Retry helps in the places when our code acts as a client of some other backend. It can be a wrapper for a database client, a client for an HTTP server, etc. As a consumer, we expect (sometimes without any reason) that the problem that prevents successful processing can be fixed by someone else shortly.</p>
<p>I can name two categories of problems when I look for that:</p>
<ul>
<li>errors during data transfer</li>
<li>failures of processing because of load issues</li>
</ul>
<p>In <strong>the first one</strong>, our request even does not reach the eventual endpoint - application code that handles your query on the other side. Possible reasons from my experience can be various:</p>
<ul>
<li><code class="highlighter-rouge">DNSError</code> - domain name cannot be resolved into IP. It does not always mean that the system is not available. It might happen when your destination is being redeployed.</li>
<li><code class="highlighter-rouge">ConnectionError</code> - we failed to perform a connection handshake successfully. The connection might be not stable on the network level at the moment of making the request.</li>
<li><code class="highlighter-rouge">Timeout</code> - a server fails to send a byte after the specified timeout, but the connection was previously established. Some previous requests can be even processed successfully using the instance of your <code class="highlighter-rouge">HTTPClient</code>. The error may occur when the infrastructure of your destination scaled down, and the particular server that you used does not exist now. But service is available. You need to recreate <code class="highlighter-rouge">HTTPClient</code> and try to establish a new connection.</li>
</ul>
<p>Some common database issue from the same group:</p>
<ul>
<li><code class="highlighter-rouge">OperationalError</code> - the errors occur when connection to storage is lost or cannot be established at the moment. I think that it worth to recreate an instance of the client and try to connect again in such cases. You can see the examples of error codes returned by PostgreSQL:
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Class 08 — Connection Exception
08000 connection_exception
08003 connection_does_not_exist
08006 connection_failure
08001 sqlclient_unable_to_establish_sqlconnection
08004 sqlserver_rejected_establishment_of_sqlconnection
</code></pre></div> </div>
</li>
<li><code class="highlighter-rouge">ProtocolError</code> is an example from Redis. The exception is raised when Redis server received a sequence of bytes that is translated to a meaningless operation. Since you test your software before deploying that, it’s unlikely that the error occurs because of poorly written code. Let’s blame our transport layer :).</li>
</ul>
<p>Thinking about the 2nd category of failures aka <strong>load issues</strong> I’d like to review the following responses from HTTP server:</p>
<ul>
<li><code class="highlighter-rouge">408 Request Timeout</code> is returned when a server spent more time processing your request than it was prepared to wait. Possible reason: a resource is overwhelmed by a lot of incoming requests. Waiting and retrying after some delay can be a good strategy to finish data processing on your side eventually.</li>
<li><code class="highlighter-rouge">429 Too Many Requests</code> means that you sent more requests than the server allows you during some time frame. This technique that is used by the server is also known as rate-limiting. A good thing, that a server SHOULD return <code class="highlighter-rouge">Retry-After</code> header that provides a recommendation how long you need to wait before making the next request.</li>
<li><code class="highlighter-rouge">500 Internal Server Error</code>. That’s the most infamous HTTP server error. The diversity of reasons for the error depends only on the good faith of the developers. For all uncaught exception occurred there the response is returned. I do not have a strong opinion that we should continuously retry on such errors. For each service that you use you should learn what’s the reason behind the response.
<blockquote>
<p>For the developers of web servers that are reading the lines, I suggest preventing sending of the type of response if possible. Think about using more specific response when you know the reason of failure.</p>
</blockquote>
</li>
<li><code class="highlighter-rouge">503 Service Unavailable</code> - service currently cannot handle the request because of <em>temporary</em> overload. You can expect that it will be alleviated after some delay. The server CAN send <code class="highlighter-rouge">Retry-After</code> header like it was mentioned for <code class="highlighter-rouge">429 Too Many Requests</code> status code.</li>
<li><code class="highlighter-rouge">504 Gateway Timeout</code> is similar to <code class="highlighter-rouge">408 Request Timeout</code> but means that connection with your HTTP client was closed by the reverse-proxy that stands in front of the server.</li>
</ul>
<p>But, hopefully, we do not live in the world where everything is transmitted over HTTP protocol. I want to share my experience of retrying on database errors:</p>
<ul>
<li><code class="highlighter-rouge">OperationalError</code>. Yes, we already met this guy in the blog post. Both for PostgreSQL and MySQL it additionally covers the failures that are not under control of a software engineer. Examples: a memory allocation error occurred during processing, or a transaction could not be processed. I recommend retrying on them.</li>
<li><code class="highlighter-rouge">IntegrityError</code> - this is a tricky one. It can be raised when a foreign key constraint is violated, like when you try to insert a <code class="highlighter-rouge">Record A</code> that depends on <code class="highlighter-rouge">Record B</code>. And <code class="highlighter-rouge">Record B</code> might be not added yet because of asynchronous nature of your system. In this case, I’d retry. From another side, the exception is also raised when your attempt to add a record leads to duplication of the unique key. It’s unlikely that we want to retry that time. You might ask me how to distinguish such cases and retry when it’s needed. Hopefully, your DBMS returns code of the error. And your SQL driver already knows how to map them between exception classes. Here’s the example for MySQL:</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># from pymysql.err</span>
<span class="n">_map_error</span><span class="p">(</span><span class="n">IntegrityError</span><span class="p">,</span> <span class="n">ER</span><span class="o">.</span><span class="n">DUP_ENTRY</span><span class="p">,</span> <span class="n">ER</span><span class="o">.</span><span class="n">NO_REFERENCED_ROW</span><span class="p">,</span>
<span class="n">ER</span><span class="o">.</span><span class="n">NO_REFERENCED_ROW_2</span><span class="p">,</span> <span class="n">ER</span><span class="o">.</span><span class="n">ROW_IS_REFERENCED</span><span class="p">,</span>
<span class="n">ER</span><span class="o">.</span><span class="n">ROW_IS_REFERENCED_2</span><span class="p">,</span> <span class="n">ER</span><span class="o">.</span><span class="n">CANNOT_ADD_FOREIGN</span><span class="p">,</span>
<span class="n">ER</span><span class="o">.</span><span class="n">BAD_NULL_ERROR</span><span class="p">)</span>
<span class="c"># from pymysql.constants.ER</span>
<span class="n">BAD_NULL_ERROR</span> <span class="o">=</span> <span class="mi">1048</span>
<span class="n">DUP_ENTRY</span> <span class="o">=</span> <span class="mi">1062</span>
<span class="n">NO_REFERENCED_ROW</span> <span class="o">=</span> <span class="mi">1216</span>
<span class="n">ROW_IS_REFERENCED</span> <span class="o">=</span> <span class="mi">1217</span>
<span class="n">ROW_IS_REFERENCED_2</span> <span class="o">=</span> <span class="mi">1451</span>
<span class="n">NO_REFERENCED_ROW_2</span> <span class="o">=</span> <span class="mi">1452</span>
<span class="n">CANNOT_ADD_FOREIGN</span> <span class="o">=</span> <span class="mi">1215</span>
</code></pre></div></div>
<p>Knowing that information you can make your software to do not retry on <code class="highlighter-rouge">IntegrityError</code> only when it’s <code class="highlighter-rouge">DUP_ENTRY</code> and retry for other reasonable cases.
References:</p>
<ul>
<li><a href="https://github.com/PyMySQL/PyMySQL/blob/master/pymysql/constants/ER.py" target="_blank">constants for MySQL errors</a></li>
<li><a href="https://github.com/PyMySQL/PyMySQL/blob/master/pymysql/err.py#L77" target="_blank">the mapping between exception types in PyMYSQL and error codes</a></li>
</ul>
<h2 id="naive-implementation-of-retry-decorator">Naive Implementation of Retry Decorator</h2>
<p>I can think about retrying only for I/O operations. Starting 2014 we tend to make production software that performs such operations in the asynchronous fashion. Twisted and asyncio have been our friends all the time.</p>
<p>Python has the expressive concept of decorators: syntax for adding new functionality to existing functions using capabilities of <a href="https://en.wikipedia.org/wiki/Higher-order_programming" target="_blank">High-order programming</a>.</p>
<p>For example, we have the <code class="highlighter-rouge">fetch</code> function to make an asynchronous HTTP-request and download a page for <code class="highlighter-rouge">python.org</code>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Example is taken from http://aiohttp.readthedocs.io/en/stable/#getting-started</span>
<span class="kn">import</span> <span class="nn">aiohttp</span>
<span class="kn">import</span> <span class="nn">asyncio</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">fetch</span><span class="p">(</span><span class="n">session</span><span class="p">,</span> <span class="n">url</span><span class="p">):</span>
<span class="n">async</span> <span class="k">with</span> <span class="n">session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span> <span class="k">as</span> <span class="n">response</span><span class="p">:</span>
<span class="k">return</span> <span class="n">await</span> <span class="n">response</span><span class="o">.</span><span class="n">text</span><span class="p">()</span>
<span class="c"># Client code, provided for reference</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
<span class="n">async</span> <span class="k">with</span> <span class="n">aiohttp</span><span class="o">.</span><span class="n">ClientSession</span><span class="p">()</span> <span class="k">as</span> <span class="n">session</span><span class="p">:</span>
<span class="n">html</span> <span class="o">=</span> <span class="n">await</span> <span class="n">fetch</span><span class="p">(</span><span class="n">session</span><span class="p">,</span> <span class="s">'http://python.org'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">html</span><span class="p">)</span>
<span class="n">loop</span> <span class="o">=</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">get_event_loop</span><span class="p">()</span>
<span class="n">loop</span><span class="o">.</span><span class="n">run_until_complete</span><span class="p">(</span><span class="n">main</span><span class="p">())</span>
</code></pre></div></div>
<p>The <code class="highlighter-rouge">fetch</code> function works fine, but it might be not enough reliable. It’s not protected from all the HTTP errors listed above. But we can make it better:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@retry</span><span class="p">(</span><span class="n">aiohttp</span><span class="o">.</span><span class="n">DisconnectedError</span><span class="p">,</span> <span class="n">aiohttp</span><span class="o">.</span><span class="n">ClientError</span><span class="p">,</span>
<span class="n">aiohttp</span><span class="o">.</span><span class="n">HttpProcessingError</span><span class="p">)</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">fetch</span><span class="p">(</span><span class="n">session</span><span class="p">,</span> <span class="n">url</span><span class="p">):</span>
<span class="n">async</span> <span class="k">with</span> <span class="n">session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span> <span class="k">as</span> <span class="n">response</span><span class="p">:</span>
<span class="k">return</span> <span class="n">await</span> <span class="n">response</span><span class="o">.</span><span class="n">text</span><span class="p">()</span>
</code></pre></div></div>
<p>Now the function does not give up when the specified exceptions occur. It tries to perform a request several times. This easy trick can make your software more reliable and remove various sliding bugs. Let’s look how the naive implementation works under the hood:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">logging</span>
<span class="kn">from</span> <span class="nn">functools</span> <span class="kn">import</span> <span class="n">wraps</span>
<span class="n">log</span> <span class="o">=</span> <span class="n">logging</span><span class="o">.</span><span class="n">getLogger</span><span class="p">(</span><span class="n">__name__</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">retry</span><span class="p">(</span><span class="o">*</span><span class="n">exceptions</span><span class="p">,</span> <span class="n">retries</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">cooldown</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
<span class="s">"""Decorate an async function to execute it a few times before giving up.
Hopes that problem is resolved by another side shortly.
Args:
exceptions (Tuple[Exception]) : The exceptions expected during function execution
retries (int): Number of retries of function execution.
cooldown (int): Seconds to wait before retry.
verbose (bool): Specifies if we should log about not successful attempts.
"""</span>
<span class="k">def</span> <span class="nf">wrap</span><span class="p">(</span><span class="n">func</span><span class="p">):</span>
<span class="nd">@wraps</span><span class="p">(</span><span class="n">func</span><span class="p">)</span>
<span class="n">async</span> <span class="k">def</span> <span class="nf">inner</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="n">retries_count</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">await</span> <span class="n">func</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="k">except</span> <span class="n">exceptions</span> <span class="k">as</span> <span class="n">err</span><span class="p">:</span>
<span class="n">retries_count</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">message</span> <span class="o">=</span> <span class="s">"Exception during {} execution. "</span> \
<span class="s">"{} of {} retries attempted"</span><span class="o">.</span>
<span class="n">format</span><span class="p">(</span><span class="n">func</span><span class="p">,</span> <span class="n">retries_count</span><span class="p">,</span> <span class="n">retries</span><span class="p">)</span>
<span class="k">if</span> <span class="n">retries_count</span> <span class="o">></span> <span class="n">retries</span><span class="p">:</span>
<span class="n">verbose</span> <span class="ow">and</span> <span class="n">log</span><span class="o">.</span><span class="n">exception</span><span class="p">(</span><span class="n">message</span><span class="p">)</span>
<span class="k">raise</span> <span class="n">RetryExhaustedError</span><span class="p">(</span>
<span class="n">func</span><span class="o">.</span><span class="n">__qualname__</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">)</span> <span class="k">from</span> <span class="n">err</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">verbose</span> <span class="ow">and</span> <span class="n">log</span><span class="o">.</span><span class="n">warning</span><span class="p">(</span><span class="n">message</span><span class="p">)</span>
<span class="k">if</span> <span class="n">cooldown</span><span class="p">:</span>
<span class="n">await</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="n">cooldown</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">result</span>
<span class="k">return</span> <span class="n">inner</span>
<span class="k">return</span> <span class="n">wrap</span>
</code></pre></div></div>
<p>As you can see, the basic idea is to catch expected exceptions until we reach the limit for the number of <code class="highlighter-rouge">retries</code>. Between each execution, we wait <code class="highlighter-rouge">cooldown</code> seconds. Also, we write logs about each failed attempt if we want to be verbose.</p>
<blockquote>
<p>Implementing something like that with your favorite programming language can be a good exercise. Especially, when the language does not have the concept support of higher-order functions. Or decorators.</p>
</blockquote>
<h2 id="production-grade-solutions">Production-grade Solutions</h2>
<p>In the example above we have a minimum number of settings to configure:</p>
<ul>
<li>types of exceptions to retry</li>
<li>number of attempts</li>
<li>the time between attempts</li>
<li>verbosity for logging unsuccessful attempts</li>
</ul>
<p>Sometimes it’s enough. But I know cases when you need more features. Pick the ones that look sexy for you from the list of possible capabilities:</p>
<ul>
<li>retry on synchronous functions</li>
<li>stopping after some timeout, regardless of the number of attempts</li>
<li>wait random time within some boundaries between retries</li>
<li>exponential backoff sleeping between attempts</li>
<li>specifying additional attributes for exceptions to retry, like integer error codes (you remember the example about <code class="highlighter-rouge">IntegrityError</code>, right?)</li>
<li>providing hooks before and after attempts</li>
<li>retrying not on exceptions, but on some values which satisfy a predicate</li>
<li>usage of some synchronization primitive to limit the number of ongoing requests against some backend</li>
<li>reading configuration for retry logic from some other source dynamically like from Feature Flags as a Service</li>
<li>retrying forever (sounds crazy for me, but who knows about your case)</li>
</ul>
<p>If you’re looking for a retry library for adding into your Python-based product you might be interested in this third-party projects:</p>
<ul>
<li><a href="https://github.com/jd/tenacity" target="_blank">tenacity</a></li>
<li><a href="https://github.com/rholder/retrying" target="_blank">retrying</a></li>
<li><a href="https://github.com/litl/backoff" target="_blank">backoff</a></li>
<li><a href="https://github.com/h2non/riprova" target="_blank">riprova</a></li>
</ul>
<p>Not using Python on your backend? At least JavaScript, Go and Java have <a href="https://github.com/search?utf8=%E2%9C%93&q=topic%3Aretry&type=Repositories" target="_blank">open-sourced implementations</a> of retry helpers.</p>
<h2 id="summary">Summary</h2>
<p>The product that you build does not depend only on the software that you write. You need to rely on external resources like databases or other services that perform some good things for your customers.</p>
<blockquote>
<p>Backend programming is about making a wrapper of calls to the software built by somebody else.</p>
</blockquote>
<p>From my experience, I/O operations are the most vulnerable places for all kinds of random failures. In the blog post, I shared with you my recommendations when and why we should retry. But I would like to know:</p>
<p><strong>How do you decide when to retry?</strong></p>
<p><strong>What do you use to provide such functionality?</strong></p>
<hr />It’s doubtful that your backend has everything within one process: you need to read configuration, store customers’ data, write logs and metrics about the status of your software. If you’re working on a network application - it’s even more complicated: your database can be far far away from the running code. Some things can go wrong: a network blip might happen, the remote database can be overloaded by incoming requests, a query can reveal some bug in the DBMS and crash it, your data can be out of order on that side because of some reason, and so on. Microservice architecture encourages cross-process communications over the network. Now your service asks another one for its configuration, that is stored somewhere in the database. You should prepare the software to non-deterministic failures that might occur during the data transfer. And not only then. In the blog post, we will look into some common failures that can be solved with proper retrying. The basic ideas are described using Python, but experience with the language is not required for understanding.What Is a Highload Project?2017-08-30T00:00:00+00:002017-08-30T00:00:00+00:00http://allyouneedisbackend.com/blog/2017/08/30/what-is-highload<div class="image-wrapper">
<amp-img media="(min-width: 550px)" src="http://d1vt1c82ljabfd.cloudfront.net/images/what-is-highload-new.png" alt="What Is a HighLoad Project?" class="image-right" width="270" height="302" layout="fixed">
</amp-img>
</div>
<p>Highload. It was the main buzzword for me 5 or 6 years ago. Since <a href="https://en.wikipedia.org/wiki/The_Social_Network" target="_blank">The Social Network</a> movie was released, I wanted to develop such kind of software.</p>
<p>The domain area did not matter for me then: dating services for founders of dating services, illegal online casinos or websites which stream questionable video content - everything would be okay. I wanted to be a part of the team which solves complex engineering problems in scale and delivers product to <em>many thousands and millions of users simultaneously</em>.</p>
<p>I had read dozens of definitions on the Internet from different sources. But I did not understand what does highload mean. And now after years of development of various highload projects I created <strong>my very own definition of highload</strong>.</p>
<!--more-->
<hr />
<h2 id="what-the-internet-says-on-highload">What The Internet says on Highload</h2>
<p>Let me share with you the aggregation of my findings from different sources:</p>
<blockquote>
<p>Highload begins when one physical server becomes unable to handle data processing.</p>
</blockquote>
<p>It sounds reasonable, does not it? But I cannot agree with the definition because it does not count software for the systems which cannot scale at all. Like embedded ones.</p>
<blockquote>
<p>If a single instance serves 10,000 connections simultaneously - it’s highload.</p>
</blockquote>
<p>The statement is interesting and refers to the <a href="http://www.kegel.com/c10k.html" target="_blank">C10K</a> problem. But I think that it wrongfully excludes the systems which handle fewer connections.</p>
<blockquote>
<p>Your project is highload if it processes 100+ dynamic requests per second.</p>
</blockquote>
<p>It does not sound <em>serious enough</em> if we think about regular HTTP requests when an application flips a bit in a database. But if processing on backend requires a lot of CPU work - why not? Anyway, let’s skip this one because it’s not universal.</p>
<blockquote>
<p>Highload is about serving thousands and millions of users simultaneously.</p>
</blockquote>
<p>This opinion is prevalent. Not so many projects can boast such numbers. I think that having the tons of customers is not required to be a highload system. But it’s definitely enough.</p>
<blockquote>
<p>Highload starts when regular and obvious solutions stop working and you need to make some tricks to handle the traffic.</p>
</blockquote>
<p>I like this one! A kind of agree with that. Found this too late :).</p>
<blockquote>
<p>If your infrastructure cannot consume incoming streams of data and requires horizontal scaling - welcome to the Highload club.</p>
</blockquote>
<p>Horizontal scaling and highload are very often together. But it’s not a rule for me.</p>
<blockquote>
<p>Highload websites are the same as usual ones, but they have a very large audience and use a lot of optimizations to handle the load.</p>
</blockquote>
<p>It’s not only about websites. And you do not need to have the <em>large audience</em>. From another side the idea is good.</p>
<blockquote>
<p>If you use one very fat machine, your project is a rather a highload one.</p>
</blockquote>
<p>Yes, it can be true. And it’s also not a mandatory feature.</p>
<blockquote>
<p>Usage of Lambda Architecture and Kafka makes the system highload.</p>
</blockquote>
<p>We can rephrase it to the more general: <em>Usage of technology X or architectural pattern Y makes your project highload</em>. I strongly disagree with that. A student can do a personal project which never hit a real production except testing by his/her friends with non-real-world samples of data. We need to put some stress on the system from real life customers/dataset to call it <em>highload</em>.</p>
<blockquote>
<p>If you’re deployed on AWS, IBM Bluemix, Azure, or Google Cloud Platform, then you’re maintaining a highload service.</p>
</blockquote>
<p>This definition does not make a lot of sense for me. What about on-premise solutions? Does it mean that they cannot be in the Highload club? Nope. By the way, cloud computing offers a lot of services to speed up development and make scalability a bit easier.</p>
<h2 id="well-viach-whats-highload">Well, Viach, what’s Highload?</h2>
<p>As you probably noted, all the listed definitions and tons of others did not satisfy me completely. I envisioned all the experience which I had to make this statement.</p>
<blockquote>
<h2 id="what-is-a-highload-project">What is a highload project?</h2>
<h4 id="a-project-when-an-inefficient-solution-or-a-tiny-bug-has-a-huge-impact-on-your-business-the-mistake-leads-to-increase-of-cost--or-loss-of-companys-reputation">A project when an inefficient solution or a tiny bug has a huge impact on your business. The mistake leads to increase of cost <strong>$$$</strong> or loss of company’s reputation.</h4>
<h4 id="from-the-engineering-perspective-a-lack-of-resources-causes-performance-degradation">From the engineering perspective, a lack of resources causes performance degradation.</h4>
</blockquote>
<p>I like this definition more than others because:</p>
<ul>
<li>There’s nothing about numbers: servers, connections, requests, users. The features can vary from project to project.</li>
<li>We do not specify technologies, vendors, and patterns.</li>
<li>It’s easy to define for an engineer if a project is highload at this stage. If you already cannot afford to make rough decisions without impact to your business - you’re in the safe spot. Yet. Otherwise - please, be responsible and careful.</li>
<li>It provides a better explanation to business stakeholders why your team should build one solution over another. Or get buy-in for providing some additional resource. Remember, that we all make software to solve some real-world problem.</li>
</ul>
<p>Does the definition work for every case in the world? I’m not sure. But it helps me to define when we need to invest more time into optimizations and when we should avoid that.</p>
<p>Very likely, that the backend software which you make in 2017 consist of several components. Some parts of the system might require more focused efforts because of nature of highload which I mentioned. Another part might be all right with using trivial solutions copy-pasted from a tutorial.</p>
<p>Your goal as an engineer is to find the trade-off and figure this out in the best way for your business. It does not matter if you define the project as highload or not ;).</p>
<p><strong>Do you work on a highload project?</strong></p>
<p><strong>What’s your definition of highload?</strong></p>
<hr />
<p>P.S. I made a few talks <strong>about highload</strong> some time ago, check them if you’re interested to learn more on the topic:</p>
<ul>
<li>PyCon Ukraine 2016: <a href="/talks/#pycon-ukraine-2016" target="_blank">Maintaining a high load Python project for newcomers</a></li>
<li>PyCon Poland 2016: <a href="/talks/#pycon-poland-2016" target="_blank">Maintaining a high load Python project: typical mistakes </a></li>
</ul>Highload. It was the main buzzword for me 5 or 6 years ago. Since The Social Network movie was released, I wanted to develop such kind of software. The domain area did not matter for me then: dating services for founders of dating services, illegal online casinos or websites which stream questionable video content - everything would be okay. I wanted to be a part of the team which solves complex engineering problems in scale and delivers product to many thousands and millions of users simultaneously. I had read dozens of definitions on the Internet from different sources. But I did not understand what does highload mean. And now after years of development of various highload projects I created my very own definition of highload.Pull Requests: The Good, The Bad and The Ugly2017-08-24T00:00:00+00:002017-08-24T00:00:00+00:00http://allyouneedisbackend.com/blog/2017/08/24/pull-requests-good-bad-and-ugly<div class="image-wrapper">
<amp-img media="(min-width: 550px)" src="http://d1vt1c82ljabfd.cloudfront.net/images/the-good-the-bad-the-ugly-logo-small.png" alt="Pull Requests: The Good, The Bad and The Ugly" class="image-right" width="250" height="345" layout="fixed">
</amp-img>
</div>
<p>I remember that at my first paid software job we did not have a <a href="https://en.wikipedia.org/wiki/Version_control" target="_blank">version control system</a>: all code was uploaded to a folder on FTP server.</p>
<p><a href="https://git-scm.com/" target="_blank">Source Code Management</a> was not very sophisticated: we just had the previous revision of the most important source files suffixed with <code class="highlighter-rouge">.old</code> in the same directory. Now having <em>‘just Git’</em> without a fancy dashboard like the one supplied by <a href="https://bitbucket.org/" target="_blank">Bitbucket</a>, <a href="https://github.com/" target="_blank">GitHub</a>, or <a href="https://gitlab.com" target="_blank">GitLab</a> does not look suitable for me. The tools save a lot of time and are extremely convenient.</p>
<p>Working on our product, a software engineer submits pull requests almost every day. During the last year, I spent approximately 200 hours doing code review - it’s more than 1 month of work!</p>
<p><strong>I believe that merging good pull requests and declining ugly is essential for the success of your product</strong>.</p>
<p>What about bad ones? Well, we can do some work to make on them either good or ugly. Let’s review the examples representing different aspects of a pull request. Some ideas are explained using <code class="highlighter-rouge">Python</code>, but they are applicable for any other <em>non-<a href="https://en.wikipedia.org/wiki/Esoteric_programming_language" target="_blank">esoteric</a></em> programming language.</p>
<!--more-->
<hr />
<h2 id="writing-description-for-a-pr">Writing description for a PR</h2>
<p><strong>Ugly.</strong></p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>No description (tumbleweed)
</code></pre></div></div>
<p>Name of a PR and its description - it’s the first thing which your teammate sees. She or he kindly switched from current task to help you with building high-quality software. Love your teammates. Do not let them be distracted. Be good team fellow and drop a line.</p>
<p><strong>Bad.</strong></p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Security fixes
--------------
Make component XXX protected from the OWASP Top Security Risks:
* A-1
* A-3
* A-8
</code></pre></div></div>
<p>Isn’t it look better than the previous one? Of course, it is! The information helps to understand which teammates should be involved into code review process: now we can see that it’s not mandatory for people without expertise in Software Security field. This is very helpful within large teams.</p>
<p>At some projects, such description is <em>good enough</em>, but I think that we should do it better: this description still keeps a reviewer out of context.</p>
<p><strong>Good.</strong></p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>JRA-2017: Add protection from some of Top OWASP 2017 Security Risks. Part 1
--------------
Make component XXX protected from the risks:
* A-1. Injection. Partial support: only SQL injection.
* A-3. Cross-Site Scripting (XSS)
* A-8. Cross-Site Request Forgery (CSRF)
Link: https://www.owasp.org/index.php/Top_10_2017-Top_10
</code></pre></div></div>
<p>You can notice the improvements in the description:</p>
<ul>
<li>The PR is linked to some ticket in your issue tracking system. JIRA and Trello are the industry leaders. Bitbucket, GitHub, and GitLab kindly provided integrations with the tools. I recommend to use them, it’s very easy: you just need to prefix name of the PR with ticket’s shortcut and link to the issue is rendered automatically. This is VERY helpful for reviewers.</li>
<li>We’re more specific about the version of OWASP document which is related to the task. And we specified a link to the needed revision. Now our teammates do not need to google that, make assumptions or ask additional questions.</li>
<li>We explained A-1, A-3, and A-8. Now the risks which are addressed in the pull request are explicitly specified. We also noted, that for <code class="highlighter-rouge">Injection</code> part only SQL injection is addressed.</li>
</ul>
<p>The way from Ugly description to a Good one takes two minutes of your time but saves 10-15 minutes of time for each teammate.</p>
<h2 id="code-design">Code design</h2>
<p>This section is not a place for holy wars about different programming paradigms. But if you’re interested in a general overview - check out <a href="http://cs.lmu.edu/~ray/notes/paradigms/" target="_blank">the article from LMA</a>. We wanna look into code design from a high-level perspective.</p>
<p><strong>Ugly.</strong> <em>Add a new piece of code ignoring style of everything around that.</em></p>
<p>In Python, it’s very easy since the language supports multiple programming paradigms. At some point in my career I faced with the PR when inside the code, which is organized due to OOP guidelines, somebody wrote a couple of layers of lambda functions without any reasoning. Well, the code was mine and I’m feeling embarrassed because of that :). It was a stupid thing - I remember the day when I needed to make a change in the module.</p>
<p><strong>Bad.</strong> <em>The new code is formatted in the same fashion as the previous code.</em></p>
<blockquote>
<h4 id="what-why-its-considered-as-bad-viach-are-you-mad">What? Why it’s considered as bad, Viach? Are you mad?</h4>
</blockquote>
<p>Having new code looking the same as previous one it’s essential, but not enough to be good. Some reviewers agree to merge new code considering only that metric. It leads to a breach when a poor, but well-formatted code is deployed and caused some unexpected behavior in production.</p>
<p><strong>Good.</strong> <em>Code follows the architectural principles accepted in the project/repo.
New functionality is implemented by reusing existing components or by new abstractions which do not contradict with the guidelines.</em></p>
<p>Maintenance of different styles of doing the same thing is an unacceptable luxury for commercial software projects. Avoid growing of project’s toolset without growing of toolset’s functional capabilities. It should be one and only one way to do a thing.</p>
<h2 id="automated-testing">Automated testing</h2>
<p><strong>Ugly.</strong> <em>Submitting a PR without tests</em>.</p>
<p>New functionality should be supplied with tests, bug fixes should be covered with tests. If you do not have time to write tests today - you will find the time for fixing bugs Friday’s night :(. Negotiate this with your management and include time for writing tests to your schedule</p>
<p><strong>Bad.</strong> <em>Making a PR which has <strong>some</strong> tests</em>.</p>
<p>This is a dangerous spot. Reviewers might be confused.</p>
<blockquote>
<h4 id="oh-we-have-some-tests-looks-like-its-ok-to-merge-it">Oh, we have some tests, looks like it’s OK to merge it.</h4>
</blockquote>
<p>Having only one test for a function is not a rule. Sometimes you need more, sometimes you do not need tests at all (for example, if you’re just wrapping functions written by others).</p>
<p><strong>Good.</strong> <em>Providing comprehensive tests both for the happy path and corner cases for the new/changed functionality in your PR</em>. Consider writing tests for API of your functions before the actual implementation. I liked the trick so much since it helped me to make a design better.</p>
<h2 id="refactoring-of-existing-code">Refactoring of existing code</h2>
<p><strong>Ugly.</strong> <em>Refactoring of the code which is not covered by tests.</em></p>
<blockquote>
<h4 id="wtf-refactoring-without-automated-tests-aaaaaaaah">WTF? Refactoring without automated tests? AAAAAAAAH</h4>
</blockquote>
<p>Yes, it happens. And this is definitely ugly. Do not do that.</p>
<p><strong>Bad.</strong> <em>Doing a massive refactoring as a part of some other feature ticket</em>.</p>
<p>With the best intentions, you made the code better readable and maintainable during work on your current task. Great job! But why your business might be not happy? Firstly, delivery of the original task was delayed since you worked on the refactoring of another part of code. Secondary, other engineers might be working on a code change in the same place/file. He or she might be not happy to rebase working branch to unexpectedly consume your changes.</p>
<p><strong>Good.</strong> <em>Extracting of refactoring to a separate ticket. Negotiation of the change with the team. Doing the work as a part of Technical Debt effort.</em></p>
<p>Refactoring is a good thing, but your team might be not ready for the goodness. Speak up and prioritize the work.</p>
<h2 id="maintaining-docstrings">Maintaining docstrings</h2>
<p>In Python usage of <a href="https://en.wikipedia.org/wiki/Docstring" target="_blank">docstrings</a> is a thing. Documentation for libraries or HTTP API endpoints can be generated using them.</p>
<p><strong>Ugly.</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">do_complex_thing</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="s">"""
TODO: add a docstring
"""</span>
</code></pre></div></div>
<p>My recommendation: <em>‘public’</em> functions with more than 30 LOC should have docstrings. They can be used to explain not straightforward details of implementation or link code with the related ticket in issue tracking system.</p>
<p><strong>Bad.</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">multiply</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span>
<span class="s">"""Adds two integers.
Args:
a (str): The first parameter.
b (int): The second parameter.
Returns:
int: Sum.
"""</span>
<span class="k">return</span> <span class="n">a</span> <span class="o">*</span> <span class="n">b</span>
</code></pre></div></div>
<p>In this example, you can see one of the sins of developers: <strong>copypasta</strong>. Name of the function, code, and docstring do not match. It can be very confusing for other contributors.</p>
<p>Possible options what the function means:</p>
<ul>
<li>The function should return <em>multiplication of 2 integers</em>. In this instance, docstring should be fixed: description of the function, type of the first parameter and description of returned value.</li>
<li>The function should return <em>sum of 2 integers</em>. Function name should be <code class="highlighter-rouge">sum</code> and code has a typo: <code class="highlighter-rouge">*</code> must be replaced with <code class="highlighter-rouge">+</code>.</li>
<li>The function should return <em>string repeated several times</em>. In that case, docstring should be improved: description of function and type/description of returned value.</li>
</ul>
<p>Any docstring which is outdated or not actual leads to a problem with maintaining your code.</p>
<blockquote>
<h4 id="omg-its-not-at-an-ice-cream-do-not-want">OMG! It’s not at an ice cream! Do not want!</h4>
</blockquote>
<p><strong>Good.</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">multiply</span><span class="p">(</span><span class="n">s</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">n</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">str</span><span class="p">:</span>
<span class="s">"""Repeats a string multiple times.
Args:
s (str): name to repeat.
n (int): multiplier.
Examples:
>>> multiply('Backend', 2)
'BackendBackend'
>>> multiply('Omn', 3)
'OmnOmnOmn'
"""</span>
<span class="k">return</span> <span class="n">s</span> <span class="o">*</span> <span class="n">n</span>
</code></pre></div></div>
<p>A few reasons why this version is better:</p>
<ul>
<li>the docstring contains actual information about the function</li>
<li>we have an example of usage for the code</li>
<li>we use <a href="https://www.python.org/dev/peps/pep-3107/" target="_blank">type annotations</a> to specify types of parameters and returned value. It’s used not only to explain the code but also consumed by static code analyzers.</li>
</ul>
<h2 id="tracking-and-tracing">Tracking and tracing</h2>
<p>Logging and sending metrics it’s a green field for awkward moments.</p>
<p><strong>Ugly.</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="s">'DEBUG: {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">password</span><span class="p">))</span>
</code></pre></div></div>
<p>Have you submitted something like that to review? I’ve done it :) Hopefully, it was caught by my teammates. Yes, having user’s credentials logged somewhere is not very cool as well as having <code class="highlighter-rouge">print</code> statements in your production code.</p>
<p><strong>Bad.</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">logging</span>
<span class="n">logging</span><span class="o">.</span><span class="n">basicConfig</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="o">.</span><span class="n">INFO</span><span class="p">)</span>
<span class="n">logger</span> <span class="o">=</span> <span class="n">logging</span><span class="o">.</span><span class="n">getLogger</span><span class="p">(</span><span class="n">__name__</span><span class="p">)</span>
<span class="c"># reading from database</span>
<span class="n">logger</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="s">'Error during retrieving data from database.'</span><span class="p">)</span>
<span class="c"># doing some low-level task</span>
<span class="n">logger</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="s">'DevVM only: the function XXX is called with parameter </span><span class="si">%</span><span class="s">s'</span><span class="p">)</span>
</code></pre></div></div>
<p>Here we do not use logging levels properly. For the first message, we should use <code class="highlighter-rouge">logger.error</code> and for the second one - <code class="highlighter-rouge">logger.debug</code>. Also, I think that in some cases it’s reasonable to have several loggers for different parts of your application. See the example below.</p>
<p><strong>Good.</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">logging</span>
<span class="kn">import</span> <span class="nn">logging.config</span>
<span class="c"># Specifying several loggers.</span>
<span class="c"># Example is taken from aiohttp</span>
<span class="n">access_logger</span> <span class="o">=</span> <span class="n">logging</span><span class="o">.</span><span class="n">getLogger</span><span class="p">(</span><span class="s">'aiohttp.access'</span><span class="p">)</span>
<span class="n">client_logger</span> <span class="o">=</span> <span class="n">logging</span><span class="o">.</span><span class="n">getLogger</span><span class="p">(</span><span class="s">'aiohttp.client'</span><span class="p">)</span>
<span class="n">internal_logger</span> <span class="o">=</span> <span class="n">logging</span><span class="o">.</span><span class="n">getLogger</span><span class="p">(</span><span class="s">'aiohttp.internal'</span><span class="p">)</span>
<span class="n">server_logger</span> <span class="o">=</span> <span class="n">logging</span><span class="o">.</span><span class="n">getLogger</span><span class="p">(</span><span class="s">'aiohttp.server'</span><span class="p">)</span>
<span class="c"># Customization of logger</span>
<span class="n">logging</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">dictConfig</span><span class="p">({</span>
<span class="s">'version'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="s">'disable_existing_loggers'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
<span class="s">'formatters'</span><span class="p">:</span> <span class="p">{</span>
<span class="s">'standard'</span><span class="p">:</span> <span class="p">{</span>
<span class="s">'format'</span><span class="p">:</span> <span class="s">'</span><span class="si">%(asctime)</span><span class="s">s [</span><span class="si">%(levelname)</span><span class="s">s] </span><span class="si">%(name)</span><span class="s">s: </span><span class="si">%(message)</span><span class="s">s'</span>
<span class="p">},</span>
<span class="p">},</span>
<span class="s">'handlers'</span><span class="p">:</span> <span class="p">{</span>
<span class="s">'default'</span><span class="p">:</span> <span class="p">{</span>
<span class="s">'level'</span><span class="p">:</span><span class="s">'INFO'</span><span class="p">,</span>
<span class="s">'class'</span><span class="p">:</span><span class="s">'logging.StreamHandler'</span><span class="p">,</span>
<span class="p">},</span>
<span class="p">},</span>
<span class="s">'loggers'</span><span class="p">:</span> <span class="p">{</span>
<span class="s">''</span><span class="p">:</span> <span class="p">{</span>
<span class="s">'handlers'</span><span class="p">:</span> <span class="p">[</span><span class="s">'default'</span><span class="p">],</span>
<span class="s">'level'</span><span class="p">:</span> <span class="s">'INFO'</span><span class="p">,</span>
<span class="s">'propagate'</span><span class="p">:</span> <span class="bp">True</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">})</span>
</code></pre></div></div>
<p>Customization of logger’s configuration and having per-component loggers are good.
I recommend reading the <a href="https://fangpenlin.com/posts/2012/08/26/good-logging-practice-in-python/" target="_blank">interesting post</a> about proper logging in Python. And the official <a href="https://docs.python.org/3/howto/logging.html" target="_blank">logging tutorial</a>.</p>
<h2 id="making-ux-changes">Making UX changes</h2>
<blockquote>
<h4 id="backend-interview-well-you-balanced-the-tree-really-great">Backend interview: Well, you balanced the tree really great.</h4>
<h4 id="work-could-you-please-move-the-button-3px-to-the-left">Work: Could you please move the button 3px to the left?</h4>
</blockquote>
<p>This blog is about backend software engineering, but sometimes we need to modify web UI. Since frontend is not one of my core competencies, I cannot provide an <em>Ugly</em> example :).</p>
<p><strong>Bad.</strong> <em>Description of pull request with UX does not have any difference from a backend-only one.</em>
I personally do not like this approach, because it’s not obvious what’s changed. But in the teams which make UX changes on daily basis, it might be okay.</p>
<p><strong>Good.</strong> <em>A PR has a screenshot attached.</em> If we have only one screenshot, it usually shows the new state of UX.</p>
<p><strong>Awesome.</strong> <em>Pull Request has images which show the UX both before and after</em>. I think that it’s fine to be a bit verbose. If I were your teammate I would be thankful for that. Note, that UX changes may occur not only when you edit HTML or CSS directly.</p>
<h2 id="summary">Summary</h2>
<p>Pull requests are about teamwork. If you’re not interested in opinions of your fellows - there is no need in PRs, right? You can just merge your branch into master.</p>
<p>A team is healthy if every contributor cares about making code readable and maintainable by others. You can note that way from <strong>Ugly</strong> to <strong>Bad</strong> pull requests is not hard, but moving from <strong>Bad</strong> to <strong>Good</strong> one is the key to success for making commercial software projects.</p>
<p><strong>What are your recommendations related to making good pull requests?</strong></p>I remember that at my first paid software job we did not have a version control system: all code was uploaded to a folder on FTP server. Source Code Management was not very sophisticated: we just had the previous revision of the most important source files suffixed with .old in the same directory. Now having ‘just Git’ without a fancy dashboard like the one supplied by Bitbucket, GitHub, or GitLab does not look suitable for me. The tools save a lot of time and are extremely convenient. Working on our product, a software engineer submits pull requests almost every day. During the last year, I spent approximately 200 hours doing code review - it’s more than 1 month of work! I believe that merging good pull requests and declining ugly is essential for the success of your product. What about bad ones? Well, we can do some work to make on them either good or ugly. Let’s review the examples representing different aspects of a pull request. Some ideas are explained using Python, but they are applicable for any other non-esoteric programming language.🌮 Tacos Delivery Over HTTP/22017-08-20T00:00:00+00:002017-08-20T00:00:00+00:00http://allyouneedisbackend.com/blog/2017/08/20/tacos-delivery-over-http2<div class="image-wrapper">
<amp-img media="(min-width: 550px)" src="http://d1vt1c82ljabfd.cloudfront.net/images/taco-delivery-over-http-latest.png" alt="Tacos Delivery Over HTTP/2" class="image-right" width="250" height="345" layout="fixed">
</amp-img>
</div>
<p>Recently I looked into <a href="https://en.wikipedia.org/wiki/HTTP/2" target="_blank">HTTP/2</a> and its comparison with <a href="https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol" target="_blank">HTTP/1.1</a>. The adoption of the technology is growing - 16.1% of top Alexa websites already use the latest version of the protocol.</p>
<p>I wanna understand HTTP/2 better. For now, I do not see any project to apply the technology in production. But we all know that another great way to learn something - is to teach somebody else. It happened that y’all are selected as the audience for that :)</p>
<p>In teaching and learning, it’s vital to keep things interesting.</p>
<p>Tacos are definitely not boring.
In the blog post, we will try to imagine that we live in the world where <strong>web servers deliver tacos instead of HTML-pages</strong>. Let’s contemplate pros of serving this delicious <a href="https://en.wikipedia.org/wiki/Tex-Mex" target="_blank">Tex-Mex</a> food over HTTP/2 instead of regular HTTP/1.1.</p>
<p><em>Do not read it if you’re hungry!</em></p>
<!--more-->
<hr />
<h2 id="one-day-in-the-life-of-a-taco">One day in the life of a taco</h2>
<p>A process of taco delivery is not very straightforward. Let’s revise it:</p>
<ol>
<li>You realize that you’re hungry.</li>
<li>You choose the one which is the best for you.</li>
<li>You ask a chef to cook it.</li>
<li>The chef cooks the food for you.</li>
<li>You wait and polls chef for the status of your order since you’re hungry.</li>
<li>It’s ready! You provide delivery address and instructions, like gate code, etc.</li>
<li>Chef gives your order to a courier.</li>
<li>You wait and poll courier for the status of your order since you’re very hungry…</li>
<li>Courier arrives and brings your tacos. But it’s not the end.</li>
<li>You pay for the service to courier with cash. Now you can eat. And you eat.</li>
<li>Courier takes your money and brings that to Taco Shop.</li>
<li>You find out that you need more tacos because you’re still hungry… Go to the step 2 :)</li>
</ol>
<p>In this exaggerated example, you can easily see request/response pattern: your action leads to a reaction from another side. And you cannot initiate another action until the previous one isn’t processed (well, it depends).</p>
<p>In the next three sections, we will review which techniques from HTTP/1.1, HTTP/2 and HTTPS can be used to improve your experience as a taco-eater. Each point below has the following structure: the ironic example from Tex-Mex cuisine and reference to real feature of a protocol.</p>
<h2 id="http1x-as-a-baseline">HTTP/1.x as a baseline</h2>
<p>If your Taco Shop uses HTTP/1.0 you need to wait between ordering the next taco while a chef cooks a one for you. But <strong>with HTTP/1.1 you can order bacon and eggs taco and just after that a migas one without waiting for the first one be cooked</strong>. Note, they <strong>must be</strong> served in a predefined order. Even if for some reason it’s more convenient for you to customize it. This feature of HTTP/1.1 is called <a href="https://en.wikipedia.org/wiki/HTTP_pipelining" target="_blank">HTTP pipelining</a>.</p>
<p><strong>A number of tacos delivered by one Taco Shop to the same address is limited</strong>. We have a state law (<a href="https://www.ietf.org/rfc/rfc2616.txt" target="_blank">RFC-2616</a>) to limit this to 2, but the most modern Taco Shops do not respect it, so, the actual value is about 6. When people need moar tacos they work around this by splitting their requests between different taco vendors or locations. Taco Shops open affiliated branches to satisfy the rule like <em>pork.taco.shop</em>, <em>chicken.taco.shop</em>, and <em>vegan.taco.shop</em>. In that case, businesses need to maintain all these locations. This workaround from the HTTP/1.1 world is well-known as <a href="https://blog.stackpath.com/glossary/domain-sharding/" target="_blank">Domain Sharding</a>.</p>
<p><strong>Tacos taste the best with sauces</strong>. If your Taco Shop uses HTTP/1.x, they send you all sauces even if you do not need some of them. That’s it: if a Taco Shop serves 12 types of tacos and each of them is served with a different one - they give you all of them. Even if your order consists only from 1 taco. This is done to decrease the number of sauce requests. The downside of that - a courier needs to carry more stuff than you need. It leads to some delays of tacos delivery. The example refers to <a href="https://www.w3schools.com/css/css_image_sprites.asp" target="_blank">Image Sprites</a>.</p>
<p><strong>A courier can carry only one plate with food at a time</strong>. So, if you have a party, all your orders are placed together for the fastest delivery. Yes, vegan taco, pork one, and chile con queso can be touching each other on the plate. It can be not very convenient :). I refer to <a href="https://hacks.mozilla.org/2012/12/fantastic-front-end-performance-part-1-concatenate-compress-cache-a-node-js-holiday-season-part-4/" target="_blank">the concatenation of CSS and JS files</a> and <a href="https://varvy.com/pagespeed/inline-small-css.html" target="_blank">assets inlining into HTML pages</a>.</p>
<h2 id="why-http2-can-feed-you-better">Why HTTP/2 can feed you better</h2>
<p><strong>You need fewer words to explain your needs</strong>. Taco Shop can get you with a half word. It definitely saves time in a long-run. Yay, <a href="https://http2.github.io/faq/#why-do-we-need-header-compression" target="_blank">Header compression</a>!</p>
<p><strong>Your tasks can be prioritized</strong>. You can manage the order of tacos which you get, it’s not necessary first-in-first-out order. This is done thankfully to <a href="https://chadaustin.me/2014/10/http2-request-priorities-a-summary/" target="_blank">Streams and Prioritization</a>.</p>
<p><strong>If all customers enjoy the taco which you order with some additional dip better - you get it without any ask</strong>. If you do not want it today, or have some allergy - you can easily reject it. Convenient, right? Oppositely, in HTTP/1.x world you need to be explicit about all your needs. If you forgot to ask for salsa or guacamole, you just do not get it. Even if the chef knows that all customers prefer them served with the tacos. This an allusion on <a href="https://en.wikipedia.org/wiki/HTTP/2_Server_Push" target="_blank">Server Push</a> from HTTP/2 world. Do not mix that with Server-Sent Events and WebSockets.</p>
<p><strong>The number of tacos delivered to one address is not limited by the state law (<a href="https://www.ietf.org/rfc/rfc2616.txt" target="_blank">RFC-2616</a>)</strong>. Now a courier can bring as many tacos as you need. Therefore, Taco Shops do not need to have other locations to serve different types of tacos. Having <a href="http://qnimate.com/what-is-multiplexing-in-http2/" target="_blank">Multiplexed support</a> in HTTP/2 we use a single TCP connection for all requests. This means that we do not have a need in <a href="https://blog.stackpath.com/glossary/domain-sharding/" target="_blank">Domain Sharding</a> as it was with HTTP/1.1.</p>
<p><strong>Each taco is placed on a separate plate</strong>. Putting all the food for one order together in one box does not make any sense now. No need of inlining.</p>
<p><strong>A courier brings the sauces only for the tacos which you ordered</strong>. No need in CSS spriting. We lazily load only mandatory data.</p>
<h2 id="a-few-words-about-https">A few words about HTTPS</h2>
<p>This topic worth a separate blog post. For now, I wanna limit it to the context of tacos. You can have secured HTTP connections with both versions of HTTP protocol: HTTP/1.1 and HTTP/2. It provides the following benefits for us as the taco-eaters.</p>
<p><strong>Nobody knows what you order except your Taco Shop</strong>. If you tell all your friends that you’re a vegan and wanna keep the image, but like to eat pork tacos secretly: you definitely need to keep your requests encrypted. Both for HTTP/1.1 and HTTP/2 it’s optional, but with HTTP/2 it’s mostly used. This is also known as Confidentiality.</p>
<p><strong>It’s ensured that your deliveries include all the things which you ordered and only them</strong>. No substitutions and nobody can eat or even bit your taco. Or add an ingredient which you do not like or allergic. You always know that it’s cooked by the certified Chef which you trust. People in suits call this Authenticity.</p>
<p><strong>You pay for tacos and you want the money to be delivered to the Taco Shop which kindly served for you</strong>. If your HTTP connection is secured, nobody can steal your bucks. Integrity makes this happen.</p>
<h2 id="summary">Summary</h2>
<p>Delivering tacos over HTTP/2 provides better service to customers thankfully the features like Multiplexing, Streams, Prioritization, and Server Push. At the same moment, it simplifies business for Taco Shops: they can remove workarounds introduced for HTTP/1.x which slows down development speed and increases cost of maintenance.</p>
<p>Internet giants like Facetaco, Buritter, and YourTexMex already use the newest standard. But this fact does not mean you should blindly follow them. Changes like that require investment in your backend.</p>
<p>I’ve done a quick look into technologies which can help your product if you decide to make the step forward HTTP/2. From the Python side I’ve got the following:</p>
<ul>
<li><a href="http://twistedmatrix.com/pipermail/twisted-python/2016-July/030535.html" target="_blank">Twisted</a> starting 16.3 supports that. Who uses Twisted in 2017? Asynchronous Python projects started a decade ago locked down by the technology. I’ve been using Twisted in 2014-2016.</li>
<li><a href="https://python-hyper.org/h2/en/stable/" target="_blank">Hyper-h2</a> is an HTTP/2 protocol stack, written entirely in Python. The goal of the project is to be a common HTTP/2 stack for the Python ecosystem, usable in all programs regardless of concurrency model or environment. Never tried this, but looks interesting: they have examples for <a href="https://python-hyper.org/projects/h2/en/stable/asyncio-example.html" target="_blank">asyncio</a>, <a href="https://python-hyper.org/projects/h2/en/stable/twisted-example.html" target="_blank">Twisted</a>, <a href="https://python-hyper.org/projects/h2/en/stable/eventlet-example.html" target="_blank">Eventlet</a>, <a href="https://python-hyper.org/projects/h2/en/stable/curio-example.html" target="_blank">Curio</a>, <a href="https://python-hyper.org/projects/h2/en/stable/tornado-example.html" target="_blank">Tornado</a>, and <a href="https://python-hyper.org/projects/h2/en/stable/wsgi-example.html" target="_blank">WSGI</a>.</li>
<li><a href="https://github.com/django/daphne/issues/30" target="_blank">Django</a> supports HTTP/2.</li>
<li>aiohttp does not support HTTP/2 yet.</li>
</ul>
<p>Anyway, it’s recommended to put a reverse-proxy on the edge, in front your HTTP-server written with Python or another programming language. The reverse-proxy keeps HTTP/2 connections with clients and establishes HTTP/1.x connection with the web server which you built.</p>
<p>The most popular reverse-proxies:</p>
<ul>
<li>Open Source version of <a href="https://www.nginx.com/blog/nginx-1-9-5/" target="_blank">NGINX</a> supports HTTP/2.</li>
<li>AWS put support of HTTP/2 into <a href="https://www.nginx.com/blog/nginx-1-9-5/" target="_blank">Application Load Balancer</a>.</li>
<li>Apache HTTP server supports the protocol in <a href="https://httpd.apache.org/docs/2.4/mod/mod_http2.html" target="_blank">mod_http2</a>.</li>
<li><a href="https://info.varnish-software.com/blog/how-its-made-varnish-5.0-and-http/2" target="_blank">Varnish</a> has experimental support since 2016.</li>
<li><a href="http://haproxy.formilux.narkive.com/DICXg7vW/announce-haproxy-1-7-dev6#post2" target="_blank">HAProxy</a> does not have official support yet.</li>
</ul>
<p>I think that switching from HTTP/1.1 to HTTP/2 should be doable. Well, I’m done with the topic for today. Time to order some tacos.</p>
<hr />
<p><strong>What is your favorite taco?</strong></p>
<p><strong>Sorry, I meant, do you use HTTP/2 in production?</strong></p>Recently I looked into HTTP/2 and its comparison with HTTP/1.1. The adoption of the technology is growing - 16.1% of top Alexa websites already use the latest version of the protocol. I wanna understand HTTP/2 better. For now, I do not see any project to apply the technology in production. But we all know that another great way to learn something - is to teach somebody else. It happened that y’all are selected as the audience for that :) In teaching and learning, it’s vital to keep things interesting. Tacos are definitely not boring. In the blog post, we will try to imagine that we live in the world where web servers deliver tacos instead of HTML-pages. Let’s contemplate pros of serving this delicious Tex-Mex food over HTTP/2 instead of regular HTTP/1.1. Do not read it if you’re hungry!