This article is part of Typical scenarios series.
This is the primary target of EurekaLog. EurekaLog is exception tracer tool, which collects information about occurred problems (such as exceptions, hangs and leaks) in your application and notify you (as developer) about these issues.
Debugging in small scale
Debugging a single program run by a single user on a single computer is a well understood problem. It may be arduous, but follows general principles: when a user reproduces and reports an error, the programmer attaches a debugger to the running process and examines program state to deduce where algorithms or state deviated from desired behavior. When tracking particularly onerous bugs the programmer can resort to restarting and stepping through execution with the user’s data or providing the user with a version of the program instrumented to provide additional diagnostic information. Once the bug has been isolated, the programmer fixes the code and provides an updated program.
Debugging in large scale
Debugging in the large is harder. When the number of software components in a single system grows to the hundreds and the number of deployed systems grows to the millions, strategies that worked in the small, like asking programmers to triage individual error reports, fail. With hundreds of components, it becomes much harder to isolate the root cause of an error. With millions of systems, the sheer volume of error reports for even obscure bugs can become overwhelming. Worse still, prioritizing error reports from millions of users becomes arbitrary and ad hoc.
Back in old days programming teams struggled to scale with the volume and complexity of errors. Then there were tools invented, which could help to diagnose crashes in software, automatically collect a stack trace and upload this bug report to developer's server.
With EurekaLog data you can identify common real-world customer problems and provide a real-time solution to your customers. While customer support calls provide information about common issues, they do not always provide enough granular detail to debug the actual code. Further, support records indicate those problems which prompted calls — they do not indicate every instance of a crash.
Large number's law
Broad-based trend analysis of error reporting data shows that across all the issues that exist on the affected Windows platforms and the number of incidents received:
The same analysis results are generally true on a company-by-company basis too (according to Microsoft's researches in error collecting).
If you could remove humans from the critical path and scale the error reporting mechanism to admit huge numbers of error reports, then you could use the law of large numbers to your advantage. For example, you didn’t need to collect all error reports, just a statistically significant sample. And you didn’t need to collect complete diagnostic samples for all occurrences of an error with the same root cause, just enough samples to diagnose the problem and suggest correlation. Moreover, once you had enough data to allow us to fix the most frequently occurring errors, then their occurrence would decrease, bringing the remaining errors to the forefront. Finally, even if you made some mistakes, such as incorrectly diagnosing two errors as having the same root cause, once you fixed the first then the occurrences of the second would reappear and dominate future samples.
If you're waiting around for users to tell you about problems with your application, then you're seeing only a tiny fraction of all the problems that are actually occurring. Most users won't bother telling you about problems. They'll just quietly stop using your application.
That's why it's important to setup an exception and error reporting facility. It's your responsibility to ensure escape plan, if something will go wrong with your software. I.e. you not only need to protect users from errors, but you also need to protect yourself from your errors too. Errors are inevitable, and you must be prepared before they start happens. The situation will be pretty dire at this point, but some disaster recovery is possible, if you plan ahead.
You should also maintain a searchable and sortable database of errors somewhere. You need to have a central place where all of your errors are aggregated, a place which is visited by all your developers every day. Thus, bug reports will be de-facto TODO list for your team. You could also broadcast an error email notification to every developer. Or maybe have every crash automatically open a bug ticket in your bug tracking software.
Once you have a detailed report on every crash, you can sort that data by frequency and spend your coding effort resolving the most common problems. Remember: fixing 20 percent of the top reported bugs solves 80 percent of customer issues.
If you don't have a central database of your bugs - then you can't sort bugs by "popularity". If you fix a bug that no actual user will ever encounter, what have you actually fixed? Given a limited pool of developer time, it's a way too better to allocate time toward fixing most hot problems.
This data-driven feedback loop is so powerful you'll have (at least from the users' perspective) a rock-stable application in a sane number of iterations.
Automated bug reports are one of the most powerful form of feedback from your customers. The actual problems, with stack traces and other information, are collected for you, automatically and silently.
The sooner you can get your code out of your code editor and present it to real users - the sooner you'll have date to improve your software. Surely, it's very important to do as much as possible to fix bugs before shipping. The sooner you detect bug - the lower will be cost of its fixing.
However, your software will ship with bugs anyway. Everyone's software does. The question isn't how many bugs you will ship with, but how fast can you fix those bugs?
If your will practice the above mentioned approach (which is Exception Driven Development - EDD), the answer will be simple - you can improve your software almost in no time at all.
Note: the term "Exception Driven Development" was invented by Jeff Atwood.