Configuring prometheus metrics on a containerised Django app

Introduction

Properly setting up the metrics endpoints on a django app running in a docker container has a few important gotcha’s that you should be aware of. This article is specifically aimed at people running the stack described below, but aspects of this are general, so should also work in other contexts.

I’m assuming that the person reading this is already comfortable programming django applications, and knows the basics of how Django settings work.

I’m also assuming a basic knowledge of both Dockerfiles and Kubernetes apply files. If you are working with Kubernetes, I’d strongly advise also looking at Helm, since that makes deploying and managing your applications in Kubernetes so much simpler.

Stack

  • Django, using django-prometheus to expose metrics
  • Gunicorn webserver
  • App + server bundled into Docker container
  • Container deployed into Kubernetes
  • Kubernetes uses prometheus-operator to monitor applications in the cluster

The Issue: Handling multiple workers

Like most python WSGI servers, Gunicorn spawns worker processes to better handle concurrent requests. However, without further configuration, django-Prometheus assumes it is running in a single process. Unless you explicitly configure it further, exposing a metrics endpoint on the app via urls.py as shown in the README (see below) will result in each metrics request being forwarded to a different worker, leading to counters jumping up and down in value between scrapes.

urlpatterns = [
    ...
    url('', include('django_prometheus.urls')),
]

The Solution: Each worker exposes a metrics endpoint on it’s own port

How to handle this is documented in the library, but this documentation is not linked in the README and is easy to miss. The solution is to specify a port range that each worker (running it’s own copy of the app) will then try to bind a metrics endpoint to.

PROMETHEUS_METRICS_EXPORT_PORT_RANGE = range(8001, 8050)

However, there are still some issues that must be solved.

  1. How to match the port range to the number of workers
  2. How to expose those ports on a kubernetes pod spec & service
  3. How to define a prometheus-operator serviceMonitor that will automatically scrape those ports for metrics

Matching the port range to the number of workers

There are a number of ways of doing this, but the easiest is to specify the number of workers by exporting the environment variable WEB_CONCURRENCY=<desired number of workers> to the docker container running your app.

Note that if you do this, you MUST NOT set the number of workers on either the commandline or in gunicorn_config.py as this will override this environment setting.

Then if you also expose the variable METRICS_START_PORT=<desired start port number for worker metrics endpoints>, you can add the following code to your Django settings.py to read in these variables and set the port range appropriately.

import os

start_port = int(os.environ.get("METRICS_START_PORT"))
workers = int(os.environ.get("WEB_CONCURRENCY"))


PROMETHEUS_METRICS_EXPORT_PORT_RANGE = range(start_port, start_port + workers)


Caveat

Adding this setting to a settings file that is used for development with manage.py runserver will give an error about the autoreloader not working in this configuration. Either use a separate settings file for development, or run your local test server with manage.py runserver --noreload

How to expose those ports on a kubernetes pod spec & service

This is a little cumbersome, but you will need to add each port as a uniquely named port to the pod container spec. While doing so, you can also add the WEB_CONCURRENCY and METRICS_START_PORT settings to your environment.

deployment.yml

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    ...
  labels:
    app.kubernetes.io/instance: app
    app.kubernetes.io/name: app-django
  name: app-django
  namespace: app
spec:
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: app
      app.kubernetes.io/name: app-django
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: app
        app.kubernetes.io/name: app-django
    spec:
      containers:
      - env:
        - name: METRICS_START_PORT
          value: "5001"
        - name: WEB_CONCURRENCY
          value: "5"
        image: mycompany/app:v0.1.0
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health/
            port: http
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: django
        ports:
        - containerPort: 5000
          name: http
          protocol: TCP
        - containerPort: 5001
          name: metrics-1
          protocol: TCP
        - containerPort: 5002
          name: metrics-2
          protocol: TCP
        - containerPort: 5003
          name: metrics-3
          protocol: TCP
        - containerPort: 5004
          name: metrics-4
          protocol: TCP
        - containerPort: 5005
          name: metrics-5
          protocol: TCP

Note how we create ports named metrics-1 through metrics-5, to match the 5 workers we’ve passed to our app container running with gunicorn.

service.yml

apiVersion: v1
kind: Service
metadata:
  annotations:
    ...
  labels:
    app.kubernetes.io/instance: app
    app.kubernetes.io/name: app-django
  name: app-django
  namespace: app
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: http
  - name: metrics-1
    port: 5001
    protocol: TCP
    targetPort: metrics-1
  - name: metrics-2
    port: 5002
    protocol: TCP
    targetPort: metrics-2
  - name: metrics-3
    port: 5003
    protocol: TCP
    targetPort: metrics-3
  - name: metrics-4
    port: 5004
    protocol: TCP
    targetPort: metrics-4
  - name: metrics-5
    port: 5005
    protocol: TCP
    targetPort: metrics-5
  selector:
    app.kubernetes.io/instance: app
    app.kubernetes.io/name: app-django
  type: ClusterIP

Here again we have to match each port name and port. The prometheus-operator will use the selector + the named ports to find all the endpoints on the pods to scrape.

How to define a prometheus-operator serviceMonitor that will automatically scrape those ports for metrics.

servicemonitor.yml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    ...
  labels:
    ...
  name: app-django
  namespace: app
spec:
  endpoints:
  - interval: 15s
    port: metrics-1
  - interval: 15s
    port: metrics-2
  - interval: 15s
    port: metrics-3
  - interval: 15s
    port: metrics-4
  - interval: 15s
    port: metrics-5
  selector:
    matchLabels:
      app.kubernetes.io/name: app-django

See how the selector matches the labels on the service, and the named port names match the port names on both the deployment and the service.

Bonus: easy creation of yml template with helm.

Since there is a lot of repetition in the kubernetes apply files, it makes sense to automate their creation using helm. A full discussion of helm goes beyond the scope of this article, but I can show how to create a helm range loop to create the repeated parts of the above files.

First, in your chart’s values.yml, add the following block:

env:
  config:
    METRICS_START_PORT: 5001
    WEB_CONCURRENCY: 5

In the env block of deployment.yml, use this to add all environment variable values defined in values.yml or passed to the chart at installation time:

env:
  {{- range $key, $value := .Values.env.config }}
  - name: {{ $key }}
    value: {{ $value | quote }}
  {{- end }}

In the ports block of deployment.yml, use this to auto-generate the numbered metrics-# ports.

ports:
  - name: http
    containerPort: 5000
    protocol: TCP            
  {{- with .Values.env.config }}
  {{- $metricsPort := untilStep (int .METRICS_START_PORT) (int (add (int .METRICS_START_PORT) (int .WEB_CONCURRENCY))) 1 -}}
  {{- range $index, $port := $metricsPort }}
  - name: metrics-{{ add1 $index }}
    containerPort: {{ $port }}
    protocol: TCP
  {{- end }}
  {{- end }}

In the ports block of service.yml

ports:
    - port: {{ .Values.service.port }}
      targetPort: http
      protocol: TCP
      name: http
    {{- with .Values.env.config }}
    {{- $metricsPort := untilStep (int .METRICS_START_PORT) (int (add (int .METRICS_START_PORT) (int .WEB_CONCURRENCY))) 1 -}}
    {{- range $index, $port := $metricsPort }}
    - port: {{ $port }}
      targetPort: metrics-{{ add1 $index }}
      protocol: TCP
      name: metrics-{{ add1 $index }}
    {{- end }}
    {{- end }}

And in the endpoints block of servicemonitor.yml

endpoints:
  {{- with .Values.env.config }}
  {{- $metricsPort := untilStep (int .METRICS_START_PORT) (int (add (int .METRICS_START_PORT) (int .WEB_CONCURRENCY))) 1 -}}
  
  {{- range $index, $port := $metricsPort }}
  - port: metrics-{{ add1 $index }}
    interval: 15s
  {{- end }}
  {{- end }}

Dynamic Docker config management

I’ve been working on building a Prometheus monitoring stack in Docker swarm, and I ran into an interesting challenge, namely, how to separate my prometheus configuration update process from my container deployment process. The solution I came up with is one that I think can be adapted for other applications with similar properties.

Configuration repository with polling

Prometheus, like many opensource devops tools, uses configuration files to manage it’s configuration. The challenge was to find a way of connecting these files to the prometheus docker container and also allow for the configuration to be updated from version control. Prometheus already has an external trigger to load an updated configuration:

# Load prometheus with web.enable-lifecycle to allow reload via HTTP POST
prometheus --web.enable-lifecycle [other startup flags...]

# Trigger configuration reload
curl -X POST http://prometheus:9090/-/reload

The trick is to do the following:

  • pass –web.enable-lifecycle as command line parameter to your prometheus container
  • mount an external volume to your prometheus containers as /etc/prometheus (or wherever you have configured prometheus to find it’s configuration.
  • use a second container that also mounts the configuration volume, and does the following:
    • runs an update script on a schedule that does the following:
      • pulls the latest version of the config
      • If there is an update, copy the config to the configuration volume
      • Sends a signal to prometheus to reload the configuration

Here is an example of the update script:

#!/bin/bash

CONFIGDIR=$PROMETHEUS_CONFIG_DIR/
WORKDIR=/root
REPODIR=${WORKDIR}/prom-config
REPO=$PROMETHEUS_CONFIG_REPO
NOTIFY_HOST=$PROMETHEUS_HOST_DNS
NOTIFY_PORT=$PROMETHEUS_PORT
NOTIFY_PATH=$PROMETHEUS_NOTIFY_PATH
NOTIFY_METHOD=POST

copy_config () {
  rsync -a --exclude='.*' $REPODIR/ $CONFIGDIR
  echo "config deployed"
}

notify_endpoints () {
  for IP in $(dig $NOTIFY_HOST +short); do curl -X $NOTIFY_METHOD $IP:$NOTIFY_PORT$NOTIFY_PATH; done
  echo "endpoints notified"
}

git_initial_clone () {
    git clone $REPO $REPODIR
    echo "initial clone"
}

git_no_repo () {
    git init
    git remote add origin $REPO
    git pull origin master --force
    echo "clone to existing non-repo dir"
}

if [[ ! -d $REPODIR ]]; then
    git_initial_clone
    copy_config
    notify_endpoints
else
    cd $REPODIR
    if [ ! -d .git ]; then
        git_no_repo
        copy_config
        notify_endpoints        
    else
    	git remote update
    	UPSTREAM=${1:-'@{u}'}
    	LOCAL=$(git rev-parse @)
    	REMOTE=$(git rev-parse "$UPSTREAM")
    	BASE=$(git merge-base @ "$UPSTREAM")

    	if [ $LOCAL = $REMOTE ]; then
            echo "Up-to-date"
    	elif [ $LOCAL = $BASE ]; then
        	git pull --force
        	echo "changes pulled"
        	copy_config
            notify_endpoints
    	fi
    fi
fi

And here is an example docker-compose file that puts it all together.

---
version: '3.7'
services:
  scheduler:
    image: nralbers/scheduler:latest
    configs:
      - source: prom_update_script
        target: /etc/periodic/1min/update_prometheus
        mode: 0555
      - source: ssh_config
        target: /root/.ssh/config
        mode: 0400
    environment:
      - PROMETHEUS_CONFIG_DIR=/etc/prometheus
      - PROMETHEUS_CONFIG_REPO= <your config git repo>
      - PROMETHEUS_HOST_DNS=tasks.prometheus
      - PROMETHEUS_PORT=9090
      - PROMETHEUS_NOTIFY_PATH=/-/reload
    secrets:
      - source: ssh_key
        target: id_rsa
        mode: 0400
    volumes:
    - prom-config:/etc/prometheus  

  prometheus:
    image: prom/prometheus:latest
    ports:
    - 9090:9090
    command:
    - '--config.file=/etc/prometheus/prometheus.yml'
    - '--web.enable-lifecycle'
    - '--storage.tsdb.path=/prometheus'
    - '--web.console.libraries=/usr/share/prometheus/console_libraries'
    - '--web.console.templates=/usr/share/prometheus/consoles'
    volumes:
    - prom_data:/prometheus
    - prom_config:/etc/prometheus
    depends_on:
    - configloader
    - cadvisor
  cadvisor:
    image: google/cadvisor:latest
    ports:
    - 8080:8080
    volumes:
    - /:/rootfs:ro
    - /var/run:/var/run:rw
    - /sys:/sys:ro
    - /var/lib/docker/:/var/lib/docker:ro
configs:
  ssh_config:
    file: ssh_config
  prom_update_script:
    file: update_prometheus.sh
secrets:
  ssh_key:
    file: ${HOME}/.ssh/id_rsa
volumes:
  prom-config: {}
  prom-data: {}

The scheduler image consists of an alpine image modified to add some new cron schedules, and to be able to run the update script. It is hosted in docker hub, source code is here:
https://github.com/nralbers/docker-scheduler

FROM alpine:latest
LABEL maintainer="nralbers@gmail.com"
LABEL version="1.0"
LABEL description="Image running crond with additional schedule options for /etc/periodic/1min and /etc/periodic/5min. \
The image has bash, bind-tools, git & openssh installed. To use: bind mount the scripts you want to schedule to /etc/periodic/<period>"
RUN apk update && apk add bash bind-tools openssh git curl rsync
RUN mkdir -p /etc/periodic/1min && echo "*       *       *       *       *       run-parts /etc/periodic/1min" >> /etc/crontabs/root
RUN mkdir -p /etc/periodic/5min && echo "*/5     *       *       *       *       run-parts /etc/periodic/5min" >> /etc/crontabs/root
ENTRYPOINT ["crond", "-f", "-d", "8"]

General applications

It should be clear that while this example is aimed at dockerised deployments of prometheus, this will also work in other situations as long as the application has the following properties:

  • The application uses configuration files for configuration
  • It has an external means of forcing a configuration reload
  • It is possible to share the configuration storage volume between the application container and the script that pulls the updated config from version control and triggers the reload

Next steps…

  • Add a mechanism to validate the config before pushing to the destination system. Ideally, this should also happen on push to the configuration repository.
  • Show how to run different versions of the config from the same repository using deployment branches for dev, acceptance and production
  • Mechanism to secure configuration in a repository. This could be achieved using ansible-vault to encrypt the configuration pre-checkin and to decrypt on pull. The encryption key can be attached to the scheduler container using a docker secret.