Configuring prometheus metrics on a containerised Django app

Introduction

Properly setting up the metrics endpoints on a django app running in a docker container has a few important gotcha’s that you should be aware of. This article is specifically aimed at people running the stack described below, but aspects of this are general, so should also work in other contexts.

I’m assuming that the person reading this is already comfortable programming django applications, and knows the basics of how Django settings work.

I’m also assuming a basic knowledge of both Dockerfiles and Kubernetes apply files. If you are working with Kubernetes, I’d strongly advise also looking at Helm, since that makes deploying and managing your applications in Kubernetes so much simpler.

Stack

  • Django, using django-prometheus to expose metrics
  • Gunicorn webserver
  • App + server bundled into Docker container
  • Container deployed into Kubernetes
  • Kubernetes uses prometheus-operator to monitor applications in the cluster

The Issue: Handling multiple workers

Like most python WSGI servers, Gunicorn spawns worker processes to better handle concurrent requests. However, without further configuration, django-Prometheus assumes it is running in a single process. Unless you explicitly configure it further, exposing a metrics endpoint on the app via urls.py as shown in the README (see below) will result in each metrics request being forwarded to a different worker, leading to counters jumping up and down in value between scrapes.

urlpatterns = [
    ...
    url('', include('django_prometheus.urls')),
]

The Solution: Each worker exposes a metrics endpoint on it’s own port

How to handle this is documented in the library, but this documentation is not linked in the README and is easy to miss. The solution is to specify a port range that each worker (running it’s own copy of the app) will then try to bind a metrics endpoint to.

PROMETHEUS_METRICS_EXPORT_PORT_RANGE = range(8001, 8050)

However, there are still some issues that must be solved.

  1. How to match the port range to the number of workers
  2. How to expose those ports on a kubernetes pod spec & service
  3. How to define a prometheus-operator serviceMonitor that will automatically scrape those ports for metrics

Matching the port range to the number of workers

There are a number of ways of doing this, but the easiest is to specify the number of workers by exporting the environment variable WEB_CONCURRENCY=<desired number of workers> to the docker container running your app.

Note that if you do this, you MUST NOT set the number of workers on either the commandline or in gunicorn_config.py as this will override this environment setting.

Then if you also expose the variable METRICS_START_PORT=<desired start port number for worker metrics endpoints>, you can add the following code to your Django settings.py to read in these variables and set the port range appropriately.

import os

start_port = int(os.environ.get("METRICS_START_PORT"))
workers = int(os.environ.get("WEB_CONCURRENCY"))


PROMETHEUS_METRICS_EXPORT_PORT_RANGE = range(start_port, start_port + workers)


Caveat

Adding this setting to a settings file that is used for development with manage.py runserver will give an error about the autoreloader not working in this configuration. Either use a separate settings file for development, or run your local test server with manage.py runserver --noreload

How to expose those ports on a kubernetes pod spec & service

This is a little cumbersome, but you will need to add each port as a uniquely named port to the pod container spec. While doing so, you can also add the WEB_CONCURRENCY and METRICS_START_PORT settings to your environment.

deployment.yml

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    ...
  labels:
    app.kubernetes.io/instance: app
    app.kubernetes.io/name: app-django
  name: app-django
  namespace: app
spec:
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: app
      app.kubernetes.io/name: app-django
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: app
        app.kubernetes.io/name: app-django
    spec:
      containers:
      - env:
        - name: METRICS_START_PORT
          value: "5001"
        - name: WEB_CONCURRENCY
          value: "5"
        image: mycompany/app:v0.1.0
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health/
            port: http
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: django
        ports:
        - containerPort: 5000
          name: http
          protocol: TCP
        - containerPort: 5001
          name: metrics-1
          protocol: TCP
        - containerPort: 5002
          name: metrics-2
          protocol: TCP
        - containerPort: 5003
          name: metrics-3
          protocol: TCP
        - containerPort: 5004
          name: metrics-4
          protocol: TCP
        - containerPort: 5005
          name: metrics-5
          protocol: TCP

Note how we create ports named metrics-1 through metrics-5, to match the 5 workers we’ve passed to our app container running with gunicorn.

service.yml

apiVersion: v1
kind: Service
metadata:
  annotations:
    ...
  labels:
    app.kubernetes.io/instance: app
    app.kubernetes.io/name: app-django
  name: app-django
  namespace: app
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: http
  - name: metrics-1
    port: 5001
    protocol: TCP
    targetPort: metrics-1
  - name: metrics-2
    port: 5002
    protocol: TCP
    targetPort: metrics-2
  - name: metrics-3
    port: 5003
    protocol: TCP
    targetPort: metrics-3
  - name: metrics-4
    port: 5004
    protocol: TCP
    targetPort: metrics-4
  - name: metrics-5
    port: 5005
    protocol: TCP
    targetPort: metrics-5
  selector:
    app.kubernetes.io/instance: app
    app.kubernetes.io/name: app-django
  type: ClusterIP

Here again we have to match each port name and port. The prometheus-operator will use the selector + the named ports to find all the endpoints on the pods to scrape.

How to define a prometheus-operator serviceMonitor that will automatically scrape those ports for metrics.

servicemonitor.yml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    ...
  labels:
    ...
  name: app-django
  namespace: app
spec:
  endpoints:
  - interval: 15s
    port: metrics-1
  - interval: 15s
    port: metrics-2
  - interval: 15s
    port: metrics-3
  - interval: 15s
    port: metrics-4
  - interval: 15s
    port: metrics-5
  selector:
    matchLabels:
      app.kubernetes.io/name: app-django

See how the selector matches the labels on the service, and the named port names match the port names on both the deployment and the service.

Bonus: easy creation of yml template with helm.

Since there is a lot of repetition in the kubernetes apply files, it makes sense to automate their creation using helm. A full discussion of helm goes beyond the scope of this article, but I can show how to create a helm range loop to create the repeated parts of the above files.

First, in your chart’s values.yml, add the following block:

env:
  config:
    METRICS_START_PORT: 5001
    WEB_CONCURRENCY: 5

In the env block of deployment.yml, use this to add all environment variable values defined in values.yml or passed to the chart at installation time:

env:
  {{- range $key, $value := .Values.env.config }}
  - name: {{ $key }}
    value: {{ $value | quote }}
  {{- end }}

In the ports block of deployment.yml, use this to auto-generate the numbered metrics-# ports.

ports:
  - name: http
    containerPort: 5000
    protocol: TCP            
  {{- with .Values.env.config }}
  {{- $metricsPort := untilStep (int .METRICS_START_PORT) (int (add (int .METRICS_START_PORT) (int .WEB_CONCURRENCY))) 1 -}}
  {{- range $index, $port := $metricsPort }}
  - name: metrics-{{ add1 $index }}
    containerPort: {{ $port }}
    protocol: TCP
  {{- end }}
  {{- end }}

In the ports block of service.yml

ports:
    - port: {{ .Values.service.port }}
      targetPort: http
      protocol: TCP
      name: http
    {{- with .Values.env.config }}
    {{- $metricsPort := untilStep (int .METRICS_START_PORT) (int (add (int .METRICS_START_PORT) (int .WEB_CONCURRENCY))) 1 -}}
    {{- range $index, $port := $metricsPort }}
    - port: {{ $port }}
      targetPort: metrics-{{ add1 $index }}
      protocol: TCP
      name: metrics-{{ add1 $index }}
    {{- end }}
    {{- end }}

And in the endpoints block of servicemonitor.yml

endpoints:
  {{- with .Values.env.config }}
  {{- $metricsPort := untilStep (int .METRICS_START_PORT) (int (add (int .METRICS_START_PORT) (int .WEB_CONCURRENCY))) 1 -}}
  
  {{- range $index, $port := $metricsPort }}
  - port: metrics-{{ add1 $index }}
    interval: 15s
  {{- end }}
  {{- end }}

Closing the loop: managing production stability versus delivery velocity

People who come from a traditional IT environment will recognize the conflicts that occur when a release goes from development into production. In a situation where the development and operations organisations are separate, there is a huge disconnect between the priorities of each party. Development, working with their customers breathing down their necks, want to deliver features to production as fast as possible. Operations, keeping a beady eye on the extremely tight Service Level Agreements (SLA’s) signed off by their bosses, would rather that nothing changes at all, because changes break things. Both consider the other to be needlessly difficult and contrary.

And then we discovered there was a better way

Breaking this pattern is one of the reasons devops came into being, a fact not lost on my employer when we adopted devops as a standard over 4 years ago. However, there is more to breaking down the barriers between dev and ops than just putting them into one team. The essential dilemma here is that feature delivery does frequently cause production disruption, either due to unreliable delivery mechanisms, or due to technical debt in the code. Spotting these issues, and ensuring they get fixed, is what this blog article is about.

The need for feedback

To be able to make the determination on whether development effort should go into features or fixes, you need feedback from how your application is performing in production. To be able to determine this, there must be agreement on what acceptable performance is. The primary tool for this are the application’s Service Level Objectives (SLO’s) [1]C. J. J. P. N. R. M. Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems, O’Reilly Media,, 2016., and the Service Level Indicators (SLI’s) that are linked to these. For every application, there should be agreement on how available the application should be. This is frequently expressed as % availability, and people often talk about how many 9’s there are after the decimal. However, this can also be expressed as a time that the application can be down. For example, an application with an SLO for availability of 99.99% can be down for 52 minutes per year.

Let’s look at this differently…

We can use this time as an error budget [2]C. J. J. P. N. R. M. Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems, O’Reilly Media,, 2016.. It might sound a bit awkward, but an SLO of 99.99% gives the devops team a budget of 52 minutes of downtime per year. If the budget is rapidly being used up by incidents, then development effort must be shifted from feature releases to solving the underlying issues causing the downtime, be that in the code itself, or in fixing issues with the Continuous Integration and Delivery infrastructure. Focus here is in either implementing or improving the automation in the pipeline, or in solving performance or reliability issues in the application code itself.

Too much of a good thing however…

You would think that an application that is meeting its error budget with uptime to spare is a good thing. This isn’t necessarily the case. Bad things can happen if a component in a larger whole is perceived to be more reliable than it is required to be. These components have a habit of being reused without taking into account that they might not be available. This is why it is important to artificially generate downtime on components if they are consistently exceeding their SLO’s. Doing this will very quickly identify downstream dependencies that are making unwarranted assumptions about availability, and will help the devops team identify where further mitigation is required. Only by doing this is it possible to create a truly resilient production architecture. An example of tooling that supports this is Netflix Chaos Monkey.

Error budgets in the real world

Together with my colleagues I implemented the practises described in this article together with one of our clients. The traditional managed services contract with penalty clauses for SLA breaches was replaced. Instead the client agreed that responsibility for SLA breaches lies with all parties. Our team worked embedded in the development team to implement an error budget per application, and to work with the development specialists on making sure this budget was not breached. We are also moved from a single SLA to SLO’s per application, and ensured that the monitoring of the state of the SLO’s was visible for the development teams responsible for the applications in question. Through making adjustments to these procedures, and using the embedded team to fully automate the CI pipelines, we achieved a feature delivery velocity of 2-3 production releases/day/application, without loss of application reliability.

In conclusion

The use of error budgets and monitoring feedback to control application reliability and delivery velocity is a devops practise that has wide applicability. It is less of a technical fix, as it is a best practise for collaboration that leads to faster, more dependable production releases. Using these practises does involve a deep commitment from all parties involved in the environment. Without customer buy-in at the contract level, implementing these changes is extremely difficult. Thankfully, the benefits are obvious enough that getting that buy-in should not be an issue.

How to implement:

Do:

  • Set realistic SLO’s, SLI’s and SLA’s for each application in your environment.
  • Have a generic set of SLO’s, SLI’s and SLA’s available for use by new applications.
  • Express the current level of SLO realisation in monitoring as an error budget.
  • Have the current error budget visible on dashboards for the teams responsible for the applications.
  • Have both development and operations made responsible for meeting the error budget.
  • Have the procedure for dealing with error budget breaches set in the Standard operating procedures for dev and ops. This should involve using dev resources for ops automation if at all possible.

Don’t:

  • Fail to get management by-in at all levels.

Further information

The procedures described in this article are discussed in much greater depth in the excellent book “Site Reliability Engineering: How Google Runs production systems” [3]C. J. J. P. N. R. M. Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems, O’Reilly Media,, 2016.. Chapters 1.3, 3, 4 and 6 describe in some depth the concepts touched upon in this article.

Originally published in a slightly modified form on the Mirabeau Blog. You can read the original article here.

 

References

References
1, 2, 3 C. J. J. P. N. R. M. Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems, O’Reilly Media,, 2016.