Thoughts about Logging

Oh, I’m a lumberjack, and I’m okay,

Monty Python

Logging is one of the first things most programmers learn, but one of the (IMHO) most widely understood. We all know about print and its friends printf ,log etc. Or do we?

What is logging? Logging is the process of capturing, recording, aggregating, alerting, reporting and searching on Events. In this context, an Event is something significant which occurred within the system.

Why do we log? For many reasons. but the major ones are:

  1. Business Process Event auditing
  2. Operational integrity and troubleshooting
  3. Diagnostics and debug

Business Process Event auditing
Business process logs are the information required to understand and record the business. Usually, these are stored in a database, but are often also emitted to a log file. An example of a business process log is the rejection of a credit card at a store till. This is an Event which makes it a candidate for logging. It is a Business Process Exception Event because it relates to an event which occurs as part of a business flow and which is not the “green” or “good” flow for the process. It is also highly likely to contain Personally Identifiable Information (PII).

Business Process Events should be treated as a separate stream of event information and handled in a different pipeline to the other types of logs. Note: HTTP access logs are nearly always BPE, except when they are 5xx unhandled internal system errors. One key consideration is that PII should be handled securely and passwords/credentials should _never_ be logged. PII handling is a good reason to separate out BPE logs from operation logs as operators usually don’t need to see PII.

Operational Integrity and Troubleshooting
These types of events relate to the operation of the system, for example a new Kubernetes pod reaching operational readiness will usually output an INFO level log which indicates to a human that it is ready to begin reading requests. It does _not_ indicate to Kubernates that that specific condition is in play, rather Kubernetes uses Readiness Probes to perform that task. Other operational integrity events may be emitting an unhandled or fatal exception message (Non-fatal exceptions may also sometimes be emitted if they cause a non-business related event such as an HTTP transient failure). These messages are the ones an SRE or Ops group will be interested in.

Diagnostics and Debug
These logs are the verbose stream of events emitted by a program which is stepping through various states of an algorithm. They are used during program development to provide insight into the operation of the program. They should _not_ be on by default in the production environment, although you will often hear developers demand that “just in case!”.

In production environments metrics should be used to indicate that application is operating correctly and processing information correctly, and to record counts of non-fatal operational or Business Process Exceptions rather than emitting 10,000,000 of log lines which are not easy to analyze. Distributed tracing tools should be used to provide performance and throughput feedback as well.

This brings me to one of my rules of thumb: An application operating nominally should emit close to zero operational log messages. Only major events like “starting, started, transient error, unhandled error, faulted, stopping” should cause operational log messages to be emitted. Of these, unhandled errors are the ones which should cause alerts.

What information should be in a log message? Usually you see the following:

  1. Timestamp (important to cross-correlate events from multiple distributed sources)
  2. Trace IDs such as X-B3 which represent injected metadata for distributed tracing (more precise than timestamps)
  3. Log Level, typically one of { FATAL, ERROR, WARNING, INFO(RMATION), DEBUG, TRACE } but occasionally other things. Tip: Try to keep to this standard set. Don’t get clever with levels
  4. Labels, a set of key=value pairs describing the source of the log message, e.g. env=prod, cluster=prod-1, component=postgres
  5. Short message or message summary
  6. Detailed (long) message – this may sometimes contain an exception trace but usually
  7. Exception trace if this is an exception report

Additional fields which may appear as part of the metadata sometimes include

  1. End user ID(s) on whose behalf the system was acting. This may also sometimes be a {tenant, customer} tuple in system which host multiple tenants with different customer pools
  2. Additional structured data about the event, e.g. a BPE for a credit card transaction may include the card ID, till number, rejection code, total dollar amount etc.
  3. Source file and line of the source code which emitted the event

General rules of thumb:

  1. When logging, less is more. Log as little as is necessary to get the job done
  2. Don’t log PII unless you absolutely have to, and if you do have to log it securely
  3. Separate your log streams for different audiences. Operators usually don’t care about BPE information, business analysts don’t care about HTTP transient errors or API call rate throttling
  4. If you can, use a metric not a log. Don’t spam log files with 1,000,000s of lines of “I did a thing! It worked!”
  5. Ensure that if you emit a log event, it contains enough information to identify the source and be actionable
  6. Use structured log formats – JSON or some other tagged wire format
  7. Emit java or go or other stack straces as a single log event. Common configurations of log subsystems can emit each stack frame as a separate log message, which makes it hard for the log aggregator to handle them
  8. Keep logs in production to INFO, WARN or higher (ERROR, FATAL). Only enable DEBUG if you have no other resort

We use Loki, Tempo and Grafana for log aggregation, reporting, alerting, searching.