Why U So Stupid?

Last week I got into a discussion about logging with a friend. He was frustrated about their logging and I was pissed off after reading yet another completely useless wikipage about logging strategy. During our discussion I realized I needed to write about logging, what it is, how to do it and most importantly, why you do it. This post is basically the summary of that discussion, distilled into a few good advice and best practices that have helped me over the years.

The law of three #

Generally, I advice people asking me about logging to answer three questions, we can call this the law of three, because these questions need to be answered before you can log properly, they are:

Why do we log?
Who do we log for?
What kind of data should we log?

Why do we log? #

Give me a valid use case for logging and I can almost always answer the two follow-up questions without you giving me the answer, meaning this question is the most important one.

The use case must be very well described, it cannot be something like: “we need to log everything”, but still that seems to be the most common use case out there.

The reason why that use case doesn’t work may not be obvious, but it really should be. The purpose of a use case is to provide solution criteria, a fancy way of saying that it should clearly define what the solution is supposed to contain, so that we know when we’re done. Let’s make it clearer with examples:

“I want to launch rockets”
“I want to launch rockets capable of putting a man on the moon and returning him safely to earth”

Okay, so let’s analyze these two use cases. The second one is better defined, it tells us what the rocket is supposed to carry, where it’s heading and what the purpose is. It does give room for some scope creep, but all in all, it’s not bad for a short sentence. The first one however, leaves a lot to be desired. The destination is unknown, so maybe it’s okay if the rocket only succeeds in lift-off and then crashes, the payload is also unknown, so there’s a lot of possibility for scope creep.

A badly defined user story is always bad, but when it comes to logging, it is especially bad. This is because logs are supposed to give us information and without properly defining or understanding the reason why we log, we can expect a number of issues. From a hard to understand logging format, to logging the wrong information are a whole host of issues, that give an organization a false sense of security.

We have logs right? Well, then we can fix anything!

Who do we log for? #

A very important question to ask is who are the recipient of the data we’re logging. Who is supposed to read it, we already know why we are logging, but who will read it? If it’s not read, then why do we log it? Logs cost money, especially when we’re talking multiple logs in a load balanced production environment and the rule is to log only what will be used, don’t log more than you will actually need.

What should we log? #

The next big question to ask is what data to include in the logs. This refers back to why we log and also the consumer of the logs. Log information that the consumer want, if you don’t know, ask the team if they even need it and if so, what data they would like to be included.

Summary #

From the questions asked above, it soon becomes obvious we need multiple logs, serving different purposes, logging at different levels. Why? Well, we probably don’t management looking for KPI information to have to look through the debug log to find their information. Similarly, we most likely don’t want to clutter the debug log with KPI data, since it’s of no use to developers debugging.

A logging effort is also not a development issue, it touches all levels of your organization and should be seen as such. Treat your logs as hard currency!

Recommendations #

Here are a few recommendations about logging in general that will help you with the practical side of logging

Keep all logs in the same format. Agree on one format and stick to it. The reason is that if you some day want to use logging tools or metric tools, a common log format makes it easy to parse the logs and you save yourself a lot of job later. In the immediate, reading logs will be easier, since all logs look the same.
Use filters to log requests and responses in your backend applications. Filters should normally not be used since they can hide logic and they can make debugging much harder, but for logging purposes, they are perfect. They will ensure you never miss logging another request or response, no need to manually log these.
Use log levels correctly. A log entry with the level ERROR should be reason for concern and cause immediate escalation, because something very bad just happened. Log unrecoverable errors at the ERROR level and recoverable problems at the WARN level.
Split your logging over several files/tables, don’t mix your logs together into one massive log, let each file fulfill its own purpose.
Use a tracker id if you need to follow logging over multiple files, a unique id assigned to each request or transaction will allow you to easier follow what happens even if your logs are split into several files. It will also enable you to quickly find and sort out the debug information for something found in the error or warn log.
Keep log entries to the point. Don’t write novels, log what is needed and get out. Some logs will contain massive amounts of data, especially the debug log and the more garbage you add into it, the harder it will be to find to data you’re looking for.
Force your logs to rotate when you have little traffic. Find out when your system has little traffic and rotate all your logs at the same time, this will help you look in the right log if you need to search for a specific date and time.
Make sure to use the same timezone for all your machines, trying to follow a log spanning over multiple timezones will induce vomiting.
Plan for log retention, logs take a lot of disk space and make sure to archive your logs as soon as you can, so that you don’t run out of disk space as the application prepares to log that intermittent stacktrace you’ve been chasing for a month.
If you’re using a load balanced environment with a lot of machines, please invest in a proper log analytic solution such as the Elastic stack or Splunk. Even with the above principles in place, making your ops or dev teams search through multiple debug logs for errors should be classified as torture.