Logging

Logging tells you what's going on inside a complex system. Good logging tells you what shouldn't be going on inside a complex system. With plenty of time to fix it.

Why do we log?

Logging isn't a free operation. It takes time and consideration to design good logging into a product, and to provide the infrastructure to support it. Therefore, we log with good reason. Logging helps us to…

  • Automatically alert teams to errors and warnings
  • Form an audit trail and history of user actions
  • Provide an insight into user behavior and traffic
  • Discover new bugs
  • Determine performance under load
  • Track down configuration problems

What do we log?

In order to achieve our objectives set out above, we need to log as much detail as required to determine the action being performed, the result of that action, any errors which occurred, and the objects important to the action's success or failure. Good candidates are identifiers for the request, the originator, exceptions and response events. A good guide is to log every user action, internal state change, and any resulting errors.

While the temptation may be to log everything, we must be careful to never log sensitive information (for example, user-specific responses), nor to log information which isn't valuable in reconstructing a user journey. It's very easy to double the size of your logs by adding an unnecessary line.

Top tips:

  • Log semantic events
  • Include correlation IDs (and common tracing fields like app_event)
  • Don't be afraid to generate multiple events for a single user action
  • Err on the side of more information
  • Bracket longer transactions with "entering" and "exiting" events for each stage, even across request boundaries
  • It should be possible to identify and extract all possible user flows through the system by using these events
  • Log every error or unhappy path event, even if there is no user-visible effect
  • Include numerical data where possible for later statistical analysis

Estimating impact

When a problem occurs, it is usually important to be able to state how wide the impact of the problem was. By including identifiers and IP addresses, we can not only quickly count the number of distinct users who have been affected by an issue, but also identify those users so that remedial actions might be taken.

If users form natural groupings, such as belonging to companies, then including this information (in an anonymised fashion) will also allow you to do similar things with those groups. This can drastically reduce the time needed to perform impact analysis.

Log levels

All projects, regardless of language, agree on the six log levels defined by the Ruby Logger.

Standard log levels

  • UNKNOWN: An unknown message that should always be logged. This has been recently introduced and is rarely used; prefer a more specific level if possible
  • FATAL: An unhandleable error that results in a program crash
  • ERROR: A handleable error condition
  • WARN: A warning
  • INFO: Generic (useful) information about system operation
  • DEBUG: Low-level information for developers (e.g. underlying requests/data changes)

In production, we log at the INFO level.

Technologies

Ruby and Rails

Rails projects use the Lograge gem to tame Rails' logging, and Sinatra or plain Ruby projects use the Logging gem.

Consistency is important to us, so both approaches output in an identical format. We've written custom formatters for each and packaged them as internal gems.

Log aggregation with Splunk

We've evaluated a number of technologies for log aggregation (including logstash and Graylog2) and settled on Splunk.

Our logs are aggregated into Splunk under an enterprise license by dedicated logging infrastructure. We maintain a search indexer for each environment (for example, one per production datacenter, and one for in-house testing environments) and a search head that allows us to run distributed queries over all environments at once.

Example log format

Splunk performs best when log files begin with a UTC timestamp and are a list of key=value; pairs. Our custom formatters enforce this format across all projects.

Rails web request
2014-03-17T12:03:09.78356 method="GET"; path="/standards"; format="html"; controller="standards"; action="index"; status="200"; duration="2085.9"; view="1903.8"; db="8.35"; protocol="http"; remote_ip="127.0.0.1"; request_id="4e4c3d95-e8c1-43e8-ab69-4bfb1ef6c715"
Rails controller action
2014-03-17T12:03:09.78356 message="Listing standards"; app_event="standard.list"; user="20130827135250791408654-bcb5182deee3298e670c3dfb6968995fbbf70552"; organisation="wearefriday.com"; remote_ip="127.0.0.1"; request_id="c0ff842a-78ad-43ac-ad9d-41a1f19bbab0"

Structured data within a key-value pair is rendered as JSON. This takes advantage of Splunk's parsing and nested query capabilities.

The risks of logging

Heisenbugs

Adding any code around a sensitive feature increases its complexity, and the likelihood of introducing further bugs; Heisenbugs are problems which appear to change or disappear when you attempt to study them.

Since logging changes the code paths around features, we must be careful that our logging doesn't hide or change behavior. We encourage unit testing of logging behavior, to complement good test coverage overall.

Pollution and timing

We strive to separate logging code and application code, since they're different responsibilities. Bundling them together can introduce nasty timing issues that are hard to unpick. By separating them, you can easily insert buffering, or custom handling of certain classes of events, and get a more testable implementation of logging overall.

Expensive log operations

Most loggers will also handle blocks, so if a message is expensive to compile, we wrap it using a block that will only be evaluated if that log level is enabled.

# Only evaluated if log level is INFO or lower.
logger.info { "Result: #{expensive_method}" }

In conclusion

Logging is a balancing act — we log only what is useful to diagnose problems, assure us of normal operation, or alert us to potential issues. We don't expose sensitive data in our log files. We do log in a standard format across all projects, using the same log levels and to infrastructure which allows us to query over every system.

For Friday, logging is a conscious effort, and a worthwhile one. It allows us to provide reliable service, and identify the symptoms of problems before the problems develop.