Configuring prometheus metrics on a containerised Django app


Properly setting up the metrics endpoints on a django app running in a docker container has a few important gotcha’s that you should be aware of. This article is specifically aimed at people running the stack described below, but aspects of this are general, so should also work in other contexts.

I’m assuming that the person reading this is already comfortable programming django applications, and knows the basics of how Django settings work.

I’m also assuming a basic knowledge of both Dockerfiles and Kubernetes apply files. If you are working with Kubernetes, I’d strongly advise also looking at Helm, since that makes deploying and managing your applications in Kubernetes so much simpler.


  • Django, using django-prometheus to expose metrics
  • Gunicorn webserver
  • App + server bundled into Docker container
  • Container deployed into Kubernetes
  • Kubernetes uses prometheus-operator to monitor applications in the cluster

The Issue: Handling multiple workers

Like most python WSGI servers, Gunicorn spawns worker processes to better handle concurrent requests. However, without further configuration, django-Prometheus assumes it is running in a single process. Unless you explicitly configure it further, exposing a metrics endpoint on the app via as shown in the README (see below) will result in each metrics request being forwarded to a different worker, leading to counters jumping up and down in value between scrapes.

urlpatterns = [
    url('', include('django_prometheus.urls')),

The Solution: Each worker exposes a metrics endpoint on it’s own port

How to handle this is documented in the library, but this documentation is not linked in the README and is easy to miss. The solution is to specify a port range that each worker (running it’s own copy of the app) will then try to bind a metrics endpoint to.


However, there are still some issues that must be solved.

  1. How to match the port range to the number of workers
  2. How to expose those ports on a kubernetes pod spec & service
  3. How to define a prometheus-operator serviceMonitor that will automatically scrape those ports for metrics

Matching the port range to the number of workers

There are a number of ways of doing this, but the easiest is to specify the number of workers by exporting the environment variable WEB_CONCURRENCY=<desired number of workers> to the docker container running your app.

Note that if you do this, you MUST NOT set the number of workers on either the commandline or in as this will override this environment setting.

Then if you also expose the variable METRICS_START_PORT=<desired start port number for worker metrics endpoints>, you can add the following code to your Django to read in these variables and set the port range appropriately.

import os

start_port = int(os.environ.get("METRICS_START_PORT"))
workers = int(os.environ.get("WEB_CONCURRENCY"))

PROMETHEUS_METRICS_EXPORT_PORT_RANGE = range(start_port, start_port + workers)


Adding this setting to a settings file that is used for development with runserver will give an error about the autoreloader not working in this configuration. Either use a separate settings file for development, or run your local test server with runserver --noreload

How to expose those ports on a kubernetes pod spec & service

This is a little cumbersome, but you will need to add each port as a uniquely named port to the pod container spec. While doing so, you can also add the WEB_CONCURRENCY and METRICS_START_PORT settings to your environment.


apiVersion: apps/v1
kind: Deployment
  labels: app app-django
  name: app-django
  namespace: app
  replicas: 1
  revisionHistoryLimit: 10
    matchLabels: app app-django
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
      creationTimestamp: null
      labels: app app-django
      - env:
        - name: METRICS_START_PORT
          value: "5001"
        - name: WEB_CONCURRENCY
          value: "5"
        image: mycompany/app:v0.1.0
        imagePullPolicy: Always
          failureThreshold: 3
            path: /health/
            port: http
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: django
        - containerPort: 5000
          name: http
          protocol: TCP
        - containerPort: 5001
          name: metrics-1
          protocol: TCP
        - containerPort: 5002
          name: metrics-2
          protocol: TCP
        - containerPort: 5003
          name: metrics-3
          protocol: TCP
        - containerPort: 5004
          name: metrics-4
          protocol: TCP
        - containerPort: 5005
          name: metrics-5
          protocol: TCP

Note how we create ports named metrics-1 through metrics-5, to match the 5 workers we’ve passed to our app container running with gunicorn.


apiVersion: v1
kind: Service
  labels: app app-django
  name: app-django
  namespace: app
  - name: http
    port: 80
    protocol: TCP
    targetPort: http
  - name: metrics-1
    port: 5001
    protocol: TCP
    targetPort: metrics-1
  - name: metrics-2
    port: 5002
    protocol: TCP
    targetPort: metrics-2
  - name: metrics-3
    port: 5003
    protocol: TCP
    targetPort: metrics-3
  - name: metrics-4
    port: 5004
    protocol: TCP
    targetPort: metrics-4
  - name: metrics-5
    port: 5005
    protocol: TCP
    targetPort: metrics-5
  selector: app app-django
  type: ClusterIP

Here again we have to match each port name and port. The prometheus-operator will use the selector + the named ports to find all the endpoints on the pods to scrape.

How to define a prometheus-operator serviceMonitor that will automatically scrape those ports for metrics.


kind: ServiceMonitor
  name: app-django
  namespace: app
  - interval: 15s
    port: metrics-1
  - interval: 15s
    port: metrics-2
  - interval: 15s
    port: metrics-3
  - interval: 15s
    port: metrics-4
  - interval: 15s
    port: metrics-5
    matchLabels: app-django

See how the selector matches the labels on the service, and the named port names match the port names on both the deployment and the service.

Bonus: easy creation of yml template with helm.

Since there is a lot of repetition in the kubernetes apply files, it makes sense to automate their creation using helm. A full discussion of helm goes beyond the scope of this article, but I can show how to create a helm range loop to create the repeated parts of the above files.

First, in your chart’s values.yml, add the following block:


In the env block of deployment.yml, use this to add all environment variable values defined in values.yml or passed to the chart at installation time:

  {{- range $key, $value := .Values.env.config }}
  - name: {{ $key }}
    value: {{ $value | quote }}
  {{- end }}

In the ports block of deployment.yml, use this to auto-generate the numbered metrics-# ports.

  - name: http
    containerPort: 5000
    protocol: TCP            
  {{- with .Values.env.config }}
  {{- $metricsPort := untilStep (int .METRICS_START_PORT) (int (add (int .METRICS_START_PORT) (int .WEB_CONCURRENCY))) 1 -}}
  {{- range $index, $port := $metricsPort }}
  - name: metrics-{{ add1 $index }}
    containerPort: {{ $port }}
    protocol: TCP
  {{- end }}
  {{- end }}

In the ports block of service.yml

    - port: {{ .Values.service.port }}
      targetPort: http
      protocol: TCP
      name: http
    {{- with .Values.env.config }}
    {{- $metricsPort := untilStep (int .METRICS_START_PORT) (int (add (int .METRICS_START_PORT) (int .WEB_CONCURRENCY))) 1 -}}
    {{- range $index, $port := $metricsPort }}
    - port: {{ $port }}
      targetPort: metrics-{{ add1 $index }}
      protocol: TCP
      name: metrics-{{ add1 $index }}
    {{- end }}
    {{- end }}

And in the endpoints block of servicemonitor.yml

  {{- with .Values.env.config }}
  {{- $metricsPort := untilStep (int .METRICS_START_PORT) (int (add (int .METRICS_START_PORT) (int .WEB_CONCURRENCY))) 1 -}}
  {{- range $index, $port := $metricsPort }}
  - port: metrics-{{ add1 $index }}
    interval: 15s
  {{- end }}
  {{- end }}

Create Prometheus metrics from a dynamic source in Python

While the process for adding Prometheus metrics to a Python application is well documented in the prometheus_client documentation, dealing with adding metrics when you only know what the metric name or labels are going to be at runtime is trickier. Normal metric classes expect to be declared at module level so the default collector can pick them up. The documentation hints at a solution however. Use a Custom Collector.

The maintainer of the python client library has already done an excellent write-up on how to use custom collectors to take data from existing systems and create an exporter with them. The article (on extracting Jenkins job information) is here:

This article will describe how I took a Django application I wrote to store information on service level agreements, and expose component service window information as metrics to the application’s own metrics endpoint (Implemented with the excellent django-prometheus package).


To add a Custom collector to a Django application, you will need to do three things:

  1. Have a model or models that supply data you want to turn into metrics.
  2. Write the collector class.
  3. Register the class with the prometheus client global registry ONCE ONLY, and make sure this happens AFTER the database has initialised, and only when the django app is actually running. This last part is probably the part that caused me the most grief.

Assuming you’ve already carried out step one, this is how you go about steps 2 and 3:

Step 2: Write the collector

A collector class is a class that implements the ‘collect’ method. The ‘collect’ method is a generator, that yields <type>MetricFamily objects, where <type> can be a Counter, GaugeHistogram, Gauge, Histogram, Info, StateSet, Summary, Unknown, or Untyped metric type.

Example (

from prometheus_client.core import GaugeMetricFamily
from django.utils import timezone
from .models import Component

SERVICE_WINDOW_LAST_START_METRIC = 'service_window_last_start'
SERVICE_WINDOW_LAST_START_DOC = 'Last start time of the service window'
SERVICE_WINDOW_LAST_END_METRIC = 'service_window_last_end'
SERVICE_WINDOW_LAST_END_DOC = 'Last end time of the service window'
SERVICE_WINDOW_NEXT_START_METRIC = 'service_window_next_start'
SERVICE_WINDOW_NEXT_START_DOC = 'Next start time of the service window'
SERVICE_WINDOW_NEXT_END_METRIC = 'service_window_next_end'
SERVICE_WINDOW_NEXT_END_DOC = 'Next end time of the service window'
SERVICE_WINDOW_IN_WINDOW_METRIC = 'service_window_in_window'
SERVICE_WINDOW_IN_WINDOW_DOC = 'Is the service window active (1 for yes, 0 for no)'

class ComponentCollector(object):
    def collect(self):
        moment =
        components = Component.objects.all()
        metrics = {}

        for component in components:
            labels = component.get_labels()
            prefix ='-', '_') + "_"
            metrics[] = {
                'last_start': GaugeMetricFamily(''.join( (prefix, SERVICE_WINDOW_LAST_START_METRIC)),
                                                SERVICE_WINDOW_LAST_START_DOC, labels=labels.keys()),

                'last_end': GaugeMetricFamily(''.join( (prefix, SERVICE_WINDOW_LAST_END_METRIC)),
                                              SERVICE_WINDOW_LAST_END_DOC, labels=labels.keys()),

                'next_start': GaugeMetricFamily(''.join( (prefix, SERVICE_WINDOW_NEXT_START_METRIC)),
                                                SERVICE_WINDOW_NEXT_START_DOC, labels=labels.keys()),

                'next_end': GaugeMetricFamily(''.join( (prefix, SERVICE_WINDOW_NEXT_END_METRIC)),
                                              SERVICE_WINDOW_NEXT_END_DOC, labels=labels.keys()),

                'in_window': GaugeMetricFamily(''.join( (prefix, SERVICE_WINDOW_IN_WINDOW_METRIC)),
                                               SERVICE_WINDOW_IN_WINDOW_DOC, labels=labels.keys()),

        for comp in metrics.keys():
            for metric in metrics[comp].values():
                yield metric

In this example, I’ve taken a Component model, that exposes the service window last and next start & end times, plus indicates if the current time is in a service window for the component. The metrics:

  • <component_name>_service_window_last_start
  • <component_name>_service_window_last_end
  • <component_name>_service_window_next_start
  • <component_name>_service_window_next_end
  • <component_name>_service_window_in_window

are created, and the labels added to the component are added as metric labels to the metrics.

The <type>MetricFamily class does the rest of the work. The default prometheus registry class will run the collect once to store the metric definitions, then run collect to obtain updated metric values on each scrape.

Step 3: Registering the collector

This involves some Django trickery in the module of your project.

You will need to do the following:

  1. Write a migration hook to register if you are running a migration instead of the actual application.
  2. Write another hook to register when you’ve connected to the database.
  3. Register both hooks in the AppConfig ready method.
  4. Register your Collector class with the prometheus registry the first time the database connection hook fires ONLY.

Example (

from django.apps import AppConfig

from django.db.models.signals import post_migrate
from django.db.backends.signals import connection_created
from prometheus_client import REGISTRY

migration_executed = False
monitoring_initialised = False

def post_migration_callback(sender, **kwargs):
    global migration_executed'Migration executed')
    migration_executed = True

def connection_callback(sender, connection, **kwargs):
    global monitoring_initialised
    # Check to see if we are not running a unittest temp db
    if not connection.settings_dict['NAME'] == 'file:memorydb_default?mode=memory&cache=shared':
        if not monitoring_initialised:
            from .monitoring import ComponentCollector
            monitoring_initialised = True

class ComponentSlaMonitorConfig(AppConfig):
    name = 'component_sla_monitor'

    def ready(self):
        global migration_executed
        post_migrate.connect(post_migration_callback, sender=self)

        if not migration_executed:

Note that we only import the Collector in the connection_callback hook. This is because importing at the top of the file will cause django database errors.

Also, note the check to see if the DB connection is with an in-memory database. This is to disable monitoring registration during unit tests.

This code is based on Django 2.2. The ready method, and some of the hooks have only been available since Django 1.7

Prometheus: Adding a label to a target

Prometheus relabel configs are notoriously badly documented, so here’s how to do something simple that I couldn’t find documented anywhere: How to add a label to all metrics coming from a specific scrape target.


  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    - targets: ['localhost:9090']
    # Add your relabel config under the scrape configs
        # source label must be one that exists, so use __address__
      - source_labels: [__address__]
        # target label is the one you want to create
        target_label: my_new_label
        replacement: "my-new-label-value"

And there you have it.

This will create a new label “my_new_label” with the fixed value “my-new-label-value“.

How does it work?

If you don’t supply them, the default settings for a relabel_config are:

  • action: replace
  • regex: (.*)
  • separator: ;

By choosing a single always existing source label (__address__ always exists), you are guaranteed to get a source match for replacing the target_label with. The default regex wil always match, which causes the replacement to be carried out. However, we’re not specifying any match group’s in our replacement string, so the entire string is just copied into target_label. This is just a very specific case of how you can use a relabel_config to copy (parts of) a label into another (new) label.

Prometheus Alertmanager cluster in Docker Swarm

Prometheus monitoring and Docker combine together really well, but configuring an Alertmanager cluster can be a bit of a challenge if you don’t find the trick. This article shows a method that both works, and isn’t overly complicated to set up.

The trick is, that while it isn’t possible to pass the cluster.peer parameters correctly to a single service entry, you can use 2 or more numbered service entries instead, and define a network alias to combine them into a DNS-searchable whole for your further configuration.

Docker compose configuration

This is a sample docker compose that can be instantiated in Docker swarm using docker stack deploy –compose-file …

# docker-compose.yml
version: '3.7'
    image: prom/alertmanager:latest
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--cluster.peer=tasks.alertmanager_2:9094'
      mode: global
          - node.hostname == swarm-manager000000
          - alertmanager
        - '19093:9093'
      - alertmanager-data:/alertmanager
      - alertmanager-config:/etc/alertmanager
    image: prom/alertmanager:latest
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--cluster.peer=tasks.alertmanager_1:9094'
      mode: global
          - node.hostname == swarm-manager000001
          - alertmanager
        - '29093:9093'
      - alertmanager-data:/alertmanager
      - alertmanager-config:/etc/alertmanager
    driver: overlay
    attachable: true 
  alertmanager-data: {}  
  alertmanager-config: {} 

Note the following aspects:

  • Each alertmanager gets a named service, locked to a single node via placement constraints (fill in your own node names here).
  • The cluster.peer setting refers to the service name of the other alertmanager service(s)
  • This configuration is ready to use my method for updating configuration using git push as described in my article on Dynamic Docker configuration management
  • Because we’ve defined a network alias on each alertmanager service, we can use DNS service discovery in our Prometheus config file to find the alertmanagers, using docker swarm’s task.<servicename> DNS entries.


# Alertmanager configuration
  - dns_sd_configs:
    - names:
      - 'tasks.alertmanager'
      type: 'A'
      port: 9093

Things to be aware of…

  • When running on a cluster, if you are using a volume for storing the alertmanager configuration, you should be using a shared storage volume driver. My own swarm is running on Docker for Azure, and uses the cloudstor:azure driver.
    • If you can’t do this, you’ll have to attach your config files using configs: blocks. For static configurations this is fine, but in an active environment, versioning your config names becomes a nuisance very fast.

Dynamic Docker config management

I’ve been working on building a Prometheus monitoring stack in Docker swarm, and I ran into an interesting challenge, namely, how to separate my prometheus configuration update process from my container deployment process. The solution I came up with is one that I think can be adapted for other applications with similar properties.

Configuration repository with polling

Prometheus, like many opensource devops tools, uses configuration files to manage it’s configuration. The challenge was to find a way of connecting these files to the prometheus docker container and also allow for the configuration to be updated from version control. Prometheus already has an external trigger to load an updated configuration:

# Load prometheus with web.enable-lifecycle to allow reload via HTTP POST
prometheus --web.enable-lifecycle [other startup flags...]

# Trigger configuration reload
curl -X POST http://prometheus:9090/-/reload

The trick is to do the following:

  • pass –web.enable-lifecycle as command line parameter to your prometheus container
  • mount an external volume to your prometheus containers as /etc/prometheus (or wherever you have configured prometheus to find it’s configuration.
  • use a second container that also mounts the configuration volume, and does the following:
    • runs an update script on a schedule that does the following:
      • pulls the latest version of the config
      • If there is an update, copy the config to the configuration volume
      • Sends a signal to prometheus to reload the configuration

Here is an example of the update script:



copy_config () {
  rsync -a --exclude='.*' $REPODIR/ $CONFIGDIR
  echo "config deployed"

notify_endpoints () {
  for IP in $(dig $NOTIFY_HOST +short); do curl -X $NOTIFY_METHOD $IP:$NOTIFY_PORT$NOTIFY_PATH; done
  echo "endpoints notified"

git_initial_clone () {
    git clone $REPO $REPODIR
    echo "initial clone"

git_no_repo () {
    git init
    git remote add origin $REPO
    git pull origin master --force
    echo "clone to existing non-repo dir"

if [[ ! -d $REPODIR ]]; then
    cd $REPODIR
    if [ ! -d .git ]; then
    	git remote update
    	LOCAL=$(git rev-parse @)
    	REMOTE=$(git rev-parse "$UPSTREAM")
    	BASE=$(git merge-base @ "$UPSTREAM")

    	if [ $LOCAL = $REMOTE ]; then
            echo "Up-to-date"
    	elif [ $LOCAL = $BASE ]; then
        	git pull --force
        	echo "changes pulled"

And here is an example docker-compose file that puts it all together.

version: '3.7'
    image: nralbers/scheduler:latest
      - source: prom_update_script
        target: /etc/periodic/1min/update_prometheus
        mode: 0555
      - source: ssh_config
        target: /root/.ssh/config
        mode: 0400
      - PROMETHEUS_CONFIG_DIR=/etc/prometheus
      - PROMETHEUS_CONFIG_REPO= <your config git repo>
      - PROMETHEUS_HOST_DNS=tasks.prometheus
      - PROMETHEUS_PORT=9090
      - PROMETHEUS_NOTIFY_PATH=/-/reload
      - source: ssh_key
        target: id_rsa
        mode: 0400
    - prom-config:/etc/prometheus  

    image: prom/prometheus:latest
    - 9090:9090
    - '--config.file=/etc/prometheus/prometheus.yml'
    - '--web.enable-lifecycle'
    - '--storage.tsdb.path=/prometheus'
    - '--web.console.libraries=/usr/share/prometheus/console_libraries'
    - '--web.console.templates=/usr/share/prometheus/consoles'
    - prom_data:/prometheus
    - prom_config:/etc/prometheus
    - configloader
    - cadvisor
    image: google/cadvisor:latest
    - 8080:8080
    - /:/rootfs:ro
    - /var/run:/var/run:rw
    - /sys:/sys:ro
    - /var/lib/docker/:/var/lib/docker:ro
    file: ssh_config
    file: ${HOME}/.ssh/id_rsa
  prom-config: {}
  prom-data: {}

The scheduler image consists of an alpine image modified to add some new cron schedules, and to be able to run the update script. It is hosted in docker hub, source code is here:

FROM alpine:latest
LABEL maintainer=""
LABEL version="1.0"
LABEL description="Image running crond with additional schedule options for /etc/periodic/1min and /etc/periodic/5min. \
The image has bash, bind-tools, git & openssh installed. To use: bind mount the scripts you want to schedule to /etc/periodic/<period>"
RUN apk update && apk add bash bind-tools openssh git curl rsync
RUN mkdir -p /etc/periodic/1min && echo "*       *       *       *       *       run-parts /etc/periodic/1min" >> /etc/crontabs/root
RUN mkdir -p /etc/periodic/5min && echo "*/5     *       *       *       *       run-parts /etc/periodic/5min" >> /etc/crontabs/root
ENTRYPOINT ["crond", "-f", "-d", "8"]

General applications

It should be clear that while this example is aimed at dockerised deployments of prometheus, this will also work in other situations as long as the application has the following properties:

  • The application uses configuration files for configuration
  • It has an external means of forcing a configuration reload
  • It is possible to share the configuration storage volume between the application container and the script that pulls the updated config from version control and triggers the reload

Next steps…

  • Add a mechanism to validate the config before pushing to the destination system. Ideally, this should also happen on push to the configuration repository.
  • Show how to run different versions of the config from the same repository using deployment branches for dev, acceptance and production
  • Mechanism to secure configuration in a repository. This could be achieved using ansible-vault to encrypt the configuration pre-checkin and to decrypt on pull. The encryption key can be attached to the scheduler container using a docker secret.