Part of the programmer job description is to fix crashes. A good rule of thumb is that any bug which can be reproduced can be fixed. The problem, of course, comes when you can’t reproduce the bug.
Today I want to address a specific breed of crash — the retail server crash. By retail I mean you’ve passed automated testing, passed the QA team, and have shipped a crash to paying customers. If this is the case and you can’t get a repro in-house then it’s probably a pretty gnarly issue.
To help diagnose and fix these nasty crashes I propose that a full crash dump should be saved and stored. Hold on a second. That’s crazy talk! A full dump could be multiple gigabytes of data. Is it crazy though? I don’t think it is.
Now let me convince you why.
Getting Started — Breakpad
For full crash dumps to be worth saving there are a few pre-requisites. First of all you need to be able to save them!
Breakpad is a Google library that’s you save and upload mini-dumps in Windows, OS X, and Linux1. In typical use the dumps are small. Only a few hundred kilobytes to a few megabytes. These dumps do not provide a ton of data, but you do get a full callstack which for many bugs is all you need.
If you aren’t using Breakpad, or an equivalent, then you should do so right away. I recall it being a bit of a pain to get setup and integrated into our build pipeline. Once configured it doesn’t need much maintenance. (learn more)
Source Indexing + Symbol Server
When you load a mini-dump into the debugger wouldn’t it be great if the callstack information had propre function names instead of hex addresses? Wouldn’t it be even better if those function were linked to the version of source code used to compile the executable that generated the dump? With one time setup of source indexing and symbol server that’s exactly what you get.
The end result is you load a mini-dump into visual studio, hit F5, click the callstack tab, and voila. It “just works”. The time from downloading a mini-dump to getting useful information is seconds.
If you haven’t set this up yet then I strongly encourage doing so. Detailed information can be found in a pair of articles written by Bruce Dawson. (Source Indexing is Underused Awesomness) (Symbols the Microsoft Way)
Debugging Optimized Code
If you’ve ever tried to debug a release optimized build you know that it can have an annoying lack of information. Luckily Microsoft added a bMakeItWork compile flag that you can enable. Bruce also wrote about this — Debugging Optimized Code — New in Visual Studio 20122.
We’ve been using it for ~7 months and it’s been nothing short of fantastic. My experience has been that with the flag I have full, correct watch window information for all data on the heap. Local stack variables are also correct, but only if they don’t get optimized away which is often the case.
Planetary Annihilation is fully dependency injected with no global variables3. When debugging a release crash on my local machine I can walk the stack until I find an accessible “root” pointer, such as
SimWorld*. From there it’s easy to use the watch to window to navigate to any entity in the game. Each entity and its components are then inspectable with complete information.
Save Full Dumps
Now we are ready to save a full dump. We want to save all of the crashed process’s memory. The whole thing. Every byte. Several gigabytes worth if need be.
The elephant in the room is cost. Isn’t storing gigabytes upon gigabytes of data expensive? You might be surprised! Amazon S3 serves as a great upper bound for storage costs. On S3 uploading data is free. Storing data costs 3 cents per GB per month. Downloading data costs 12 cents per GB4.
Let’s assume a 1 gigabyte crash dump. For 15 cents you can:
You’d have to be crazy not to pay 15 cents for that! Programmers are well paid and the amount of time this saves is so enormous it should be one of the easiest decisions ever.
There have been times in my career where I’d have gladly paid thousands of dollars to trap a crash in the debugger. Now that we live in The Year of the Cloud that knowledge can be stored and accessed for a mere 15 cents. This, my friends, is the future.
Now this doesn’t solve all issues. Crashes often occur long after the root cause. A full dump can help you work backwards towards the source. That isn’t always enough so full-dumps aren’t a complete solution to all crashes ever5, but it’s an incredibly powerful first step.
Fun Story Time
In Planetary Annihilation we once had a server crash due to a NaN. If you’re familiar with NaN’s then you know they are like a virus and spread to everything they touch. The patient-zero NaN was far, far removed from the actual crash.
With a full dump here’s what we see. The crash occurrs due to a NaN in a weapon. The weapon aim bone pitch and yaw are both NaN. The weapon is attached to a unit. The owner unit does not have a NaN position or orientation. Its components are also all NaN free. The weapon is targeting another unit. The target unit does have both a NaN position and orientation! It’s DynamicObj has partial NaN data. Velocity and orientation are non-NaN but position is NaN.
We’re getting closer. There is also NaN data in the navigation component. In fact there’s a lot of NaN data. The nav component is a member of a group, meaning multiple units were moving together. The group structure has NaN data all over the place. It’s hard to determine which one is patient-zero.
Further inspection reveals an oddity. The group has an airCoreDist of 0. That’s implies the group is at the center of the planet! That can’t be correct! What happens in Group::update if airCoreDist is 0? It… divides by zero!
Mystery solved. Or rather that mystery could have been solved in short order with a full dump. As it turns out this crash occurred in a time before we even considered saving full dumps. Working backwards from the weapon to the nav group took multiple patches and a great deal of time from more than a few people.
Would I have gladly paid 15 cents to fix that bug? Yes, yes I would.
If your server is crashing regularly then storing a full dump for each one could get quite expensive. I think this is an addressable issue.
When using Amazon most of the cost, 80 percent, is in data access. Data storage is the cheap part. Only remote server crashes need remote storage so internal builds should have no cost. Retail crashes should be rare so there shouldn’t be many to begin with.
A more involved solution would be to check the callstack prior to uploading the dump. There’s little reason to store more than a few dumps per callstack.
You can of course also store the dumps on your own server and hard drive rather than in the cloud. We host our game servers across a large number of machines both physical (SoftLayer) and virtual (EC2). For easy access we want to store dumps in a central location. S3 just happens to be the simplest choice.
So far I’ve only talked about server crashes. That’s because it is 100% unacceptable to upload a multi-gigabyte crash dump from a player’s machine without permission6. If you really need a full dump from a customer then you need an interface to ask for their help to save and upload the file. It’s not something we’ve needed thus far7.
I’ve also talked almost exclusively about Windows. Breakpad does save mini-dump format crash files for Linux and OS X. These provide limited information when loaded into Visual Studio but can be converted to core dump files for more thorough debugging. I don’t have any direct experience with this so I can’t comment in detail.
Server crashes suck. Hard to repro crashes suck more. Make your life easier by saving a full dump on crash. It turns nightmare bugs into easy for only a handful of pennies.