Configuring prometheus metrics on a containerised Django app

Introduction

Properly setting up the metrics endpoints on a django app running in a docker container has a few important gotcha’s that you should be aware of. This article is specifically aimed at people running the stack described below, but aspects of this are general, so should also work in other contexts.

I’m assuming that the person reading this is already comfortable programming django applications, and knows the basics of how Django settings work.

I’m also assuming a basic knowledge of both Dockerfiles and Kubernetes apply files. If you are working with Kubernetes, I’d strongly advise also looking at Helm, since that makes deploying and managing your applications in Kubernetes so much simpler.

Stack

  • Django, using django-prometheus to expose metrics
  • Gunicorn webserver
  • App + server bundled into Docker container
  • Container deployed into Kubernetes
  • Kubernetes uses prometheus-operator to monitor applications in the cluster

The Issue: Handling multiple workers

Like most python WSGI servers, Gunicorn spawns worker processes to better handle concurrent requests. However, without further configuration, django-Prometheus assumes it is running in a single process. Unless you explicitly configure it further, exposing a metrics endpoint on the app via urls.py as shown in the README (see below) will result in each metrics request being forwarded to a different worker, leading to counters jumping up and down in value between scrapes.

urlpatterns = [
    ...
    url('', include('django_prometheus.urls')),
]

The Solution: Each worker exposes a metrics endpoint on it’s own port

How to handle this is documented in the library, but this documentation is not linked in the README and is easy to miss. The solution is to specify a port range that each worker (running it’s own copy of the app) will then try to bind a metrics endpoint to.

PROMETHEUS_METRICS_EXPORT_PORT_RANGE = range(8001, 8050)

However, there are still some issues that must be solved.

  1. How to match the port range to the number of workers
  2. How to expose those ports on a kubernetes pod spec & service
  3. How to define a prometheus-operator serviceMonitor that will automatically scrape those ports for metrics

Matching the port range to the number of workers

There are a number of ways of doing this, but the easiest is to specify the number of workers by exporting the environment variable WEB_CONCURRENCY=<desired number of workers> to the docker container running your app.

Note that if you do this, you MUST NOT set the number of workers on either the commandline or in gunicorn_config.py as this will override this environment setting.

Then if you also expose the variable METRICS_START_PORT=<desired start port number for worker metrics endpoints>, you can add the following code to your Django settings.py to read in these variables and set the port range appropriately.

import os

start_port = int(os.environ.get("METRICS_START_PORT"))
workers = int(os.environ.get("WEB_CONCURRENCY"))


PROMETHEUS_METRICS_EXPORT_PORT_RANGE = range(start_port, start_port + workers)


Caveat

Adding this setting to a settings file that is used for development with manage.py runserver will give an error about the autoreloader not working in this configuration. Either use a separate settings file for development, or run your local test server with manage.py runserver --noreload

How to expose those ports on a kubernetes pod spec & service

This is a little cumbersome, but you will need to add each port as a uniquely named port to the pod container spec. While doing so, you can also add the WEB_CONCURRENCY and METRICS_START_PORT settings to your environment.

deployment.yml

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    ...
  labels:
    app.kubernetes.io/instance: app
    app.kubernetes.io/name: app-django
  name: app-django
  namespace: app
spec:
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: app
      app.kubernetes.io/name: app-django
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: app
        app.kubernetes.io/name: app-django
    spec:
      containers:
      - env:
        - name: METRICS_START_PORT
          value: "5001"
        - name: WEB_CONCURRENCY
          value: "5"
        image: mycompany/app:v0.1.0
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health/
            port: http
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: django
        ports:
        - containerPort: 5000
          name: http
          protocol: TCP
        - containerPort: 5001
          name: metrics-1
          protocol: TCP
        - containerPort: 5002
          name: metrics-2
          protocol: TCP
        - containerPort: 5003
          name: metrics-3
          protocol: TCP
        - containerPort: 5004
          name: metrics-4
          protocol: TCP
        - containerPort: 5005
          name: metrics-5
          protocol: TCP

Note how we create ports named metrics-1 through metrics-5, to match the 5 workers we’ve passed to our app container running with gunicorn.

service.yml

apiVersion: v1
kind: Service
metadata:
  annotations:
    ...
  labels:
    app.kubernetes.io/instance: app
    app.kubernetes.io/name: app-django
  name: app-django
  namespace: app
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: http
  - name: metrics-1
    port: 5001
    protocol: TCP
    targetPort: metrics-1
  - name: metrics-2
    port: 5002
    protocol: TCP
    targetPort: metrics-2
  - name: metrics-3
    port: 5003
    protocol: TCP
    targetPort: metrics-3
  - name: metrics-4
    port: 5004
    protocol: TCP
    targetPort: metrics-4
  - name: metrics-5
    port: 5005
    protocol: TCP
    targetPort: metrics-5
  selector:
    app.kubernetes.io/instance: app
    app.kubernetes.io/name: app-django
  type: ClusterIP

Here again we have to match each port name and port. The prometheus-operator will use the selector + the named ports to find all the endpoints on the pods to scrape.

How to define a prometheus-operator serviceMonitor that will automatically scrape those ports for metrics.

servicemonitor.yml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    ...
  labels:
    ...
  name: app-django
  namespace: app
spec:
  endpoints:
  - interval: 15s
    port: metrics-1
  - interval: 15s
    port: metrics-2
  - interval: 15s
    port: metrics-3
  - interval: 15s
    port: metrics-4
  - interval: 15s
    port: metrics-5
  selector:
    matchLabels:
      app.kubernetes.io/name: app-django

See how the selector matches the labels on the service, and the named port names match the port names on both the deployment and the service.

Bonus: easy creation of yml template with helm.

Since there is a lot of repetition in the kubernetes apply files, it makes sense to automate their creation using helm. A full discussion of helm goes beyond the scope of this article, but I can show how to create a helm range loop to create the repeated parts of the above files.

First, in your chart’s values.yml, add the following block:

env:
  config:
    METRICS_START_PORT: 5001
    WEB_CONCURRENCY: 5

In the env block of deployment.yml, use this to add all environment variable values defined in values.yml or passed to the chart at installation time:

env:
  {{- range $key, $value := .Values.env.config }}
  - name: {{ $key }}
    value: {{ $value | quote }}
  {{- end }}

In the ports block of deployment.yml, use this to auto-generate the numbered metrics-# ports.

ports:
  - name: http
    containerPort: 5000
    protocol: TCP            
  {{- with .Values.env.config }}
  {{- $metricsPort := untilStep (int .METRICS_START_PORT) (int (add (int .METRICS_START_PORT) (int .WEB_CONCURRENCY))) 1 -}}
  {{- range $index, $port := $metricsPort }}
  - name: metrics-{{ add1 $index }}
    containerPort: {{ $port }}
    protocol: TCP
  {{- end }}
  {{- end }}

In the ports block of service.yml

ports:
    - port: {{ .Values.service.port }}
      targetPort: http
      protocol: TCP
      name: http
    {{- with .Values.env.config }}
    {{- $metricsPort := untilStep (int .METRICS_START_PORT) (int (add (int .METRICS_START_PORT) (int .WEB_CONCURRENCY))) 1 -}}
    {{- range $index, $port := $metricsPort }}
    - port: {{ $port }}
      targetPort: metrics-{{ add1 $index }}
      protocol: TCP
      name: metrics-{{ add1 $index }}
    {{- end }}
    {{- end }}

And in the endpoints block of servicemonitor.yml

endpoints:
  {{- with .Values.env.config }}
  {{- $metricsPort := untilStep (int .METRICS_START_PORT) (int (add (int .METRICS_START_PORT) (int .WEB_CONCURRENCY))) 1 -}}
  
  {{- range $index, $port := $metricsPort }}
  - port: metrics-{{ add1 $index }}
    interval: 15s
  {{- end }}
  {{- end }}

Using Structlog with Gunicorn

Structlog is an awesome tool for outputting useful log information that can easily be picked up by central logging tools like an ELK stack. Setting it to output rich logging events + context to stdout in json format means you are a long way towards implementing the ideals of 12 factor app logging.

However, the first google hit on “structlog gunicorn” is a FAQ from structlog version 16 that implies that you should probably use another library to output gunicorn logs in json.

Since I found this to be more than a little disappointing, I did some more digging, and discovered that setting up gunicorn to use structlog is relatively simple, and with a small addition, is completely awesome.

Configuring structlog

The key to configuring structlog in gunicorn is that when running gunicorn, it will first look for a file called gunicorn.conf.py and execute the contents of that. In there, you can set all of gunicorns runtime settings, but it also allows you to run python code. The key setting here is:

logconfig_dict

You can set this to configure all aspects of logging, including they key setting: the formatters. Creating a json_formatter entry can allow you to hook structlog into the stdlib logger in the same way as the developer describes on this page in the section “Rendering Using structlog-based Formatters Within logging“. The “foreign_pre_chain” entry also allows you to chain structlog preprocessors to do interesting things to your log entry (more on that later). Here’s the code you need to add to gunicorn.conf.py

import structlog

# --- Structlog logging initialisation code
timestamper = structlog.processors.TimeStamper(fmt="%Y-%m-%d %H:%M:%S")
pre_chain = [
    # Add the log level and a timestamp to the event_dict if the log entry
    # is not from structlog.
    structlog.stdlib.add_log_level,
    structlog.stdlib.add_logger_name,
    timestamper
]

logconfig_dict = {
    "version": 1,
    "disable_existing_loggers": False,
    "formatters": {
        "json_formatter": {
            "()": structlog.stdlib.ProcessorFormatter,
            "processor": structlog.processors.JSONRenderer(),
            "foreign_pre_chain": pre_chain,
        }
    },
    "handlers": {
        "error_console": {
            "class": "logging.StreamHandler",
            "formatter": "json_formatter",
        },
        "console": {
            "class": "logging.StreamHandler",
            "formatter": "json_formatter",
        }
    },
}

Adding this block to your gunicorn.conf.py file will start it emitting all of its log entries in json, adding level, logger and timestamps fields to each entry.

Notes

By trial and error, I discovered that gunicorn used 2 log handlers.

  • console
  • error_console

Overriding these in the logconfig_dict is what makes this method work.

access_logs as structured data

If you look at the output from gunicorn on a running server, you will see a lot of events in the json logs that contain what looks like apache combined log format (because that’s exactly what it is). Wouldn’t it be nice to split that wonderful contextual information into separate fields so you don’t have to do that later in your logging platform?

Structlog makes that easy: Just write a processor function.

Processor functions have the following signature:

def some_processor_func(logger,log_method,event_dict):
    # Do something with the event_dict here
    ...
    return event_dict

You can add them to the processing chain in the same way that the timestamper class has been added in the pre_chain list in the configuration above. For this, I borrowed some code from https://www.seehuhn.de/blog/52.html to parse combined log fields into a dictionary and added this to the event_dict like this:

# structlog_helper.py
import re

def combined_logformat(logger,name,event_dict):
    if event_dict.get('logger') == "gunicorn.access":
        message = event_dict['event']

        parts = [
            r'(?P<host>\S+)',  # host %h
            r'\S+',  # indent %l (unused)
            r'(?P<user>\S+)',  # user %u
            r'\[(?P<time>.+)\]',  # time %t
            r'"(?P<request>.+)"',  # request "%r"
            r'(?P<status>[0-9]+)',  # status %>s
            r'(?P<size>\S+)',  # size %b (careful, can be '-')
            r'"(?P<referer>.*)"',  # referer "%{Referer}i"
            r'"(?P<agent>.*)"',  # user agent "%{User-agent}i"
        ]
        pattern = re.compile(r'\s+'.join(parts) + r'\s*\Z')
        m = pattern.match(message)
        res = m.groupdict()

        if res["user"] == "-":
            res["user"] = None

        res["status"] = int(res["status"])

        if res["size"] == "-":
            res["size"] = 0
        else:
            res["size"] = int(res["size"])

        if res["referer"] == "-":
            res["referer"] = None

        event_dict.update(res)

    return event_dict

Note that the code checks the logger field in the dict to see if it is a gunicorn.access logger, and if so, it tries to parse the entries. To use this helper, you need to import this module, then add the function to the pre_chain list. Example below:

import structlog
from . import structlog_helper # or wherever you decide to put the file

# --- Structlog logging initialisation code
timestamper = structlog.processors.TimeStamper(fmt="%Y-%m-%d %H:%M:%S")
pre_chain = [
    # Add the log level and a timestamp to the event_dict if the log entry
    # is not from structlog.
    structlog.stdlib.add_log_level,
    structlog.stdlib.add_logger_name,
    timestamper,
    structlog_helper.combined_logformat # This does the magic!
]

# rest of config is identical to first version
...

So what do you get? Here’s the output from a demo project I’ve been writing for work. Sorry, I can’t publish the sources:

{"event": "Booting worker with pid: 30099", "level": "info", "logger": "gunicorn.error", "timestamp": "2019-08-15 14:19:14"}
{"event": "Booting worker with pid: 30100", "level": "info", "logger": "gunicorn.error", "timestamp": "2019-08-15 14:19:14"}
{"event": "10.96.78.6 - - [15/Aug/2019:14:19:23 +0000] \"GET /metrics HTTP/1.1\" 200 14134 \"-\" \"Prometheus/2.11.0\"", "level": "info", "logger": "gunicorn.access", "timestamp": "2019-08-15 14:19:23", "host": "10.96.78.6", "user": null, "time": "15/Aug/2019:14:19:23 +0000", "request": "GET /metrics HTTP/1.1", "status": 200, "size": 14134, "referer": null, "agent": "Prometheus/2.11.0"}

As you can see, the access logs are output with the combined log fields split out into their own json entries, ready for parsing by a central logger listening to the container output.

Summary

Structlog is an awesome tool for writing applications that log machine-readable output with context information in fields instead of badly parsable strings. Using this configuration you can easily get gunicorn show context-rich output in json.