Profiling Go Programs

TLDR

I tested the effectiveness of a cache implementation for a Go API using k6, cgroups and pprof, which helped to find the path to a 100x performance improvement.

Aims

I am writing a simple replica of the old Reddit project r/place - created by Josh Wardle of Wordle.

Place Example Image

The data for each pixel is stored in a sqlite3 database. When a request is made, currently the backend gets all pixel data from the database, and generates the PNG which is written to the response object.

The frontend polls the backend every 1 second for the latest version of the grid, as a PNG. Because of this frequency - there is no reason the image needs to be rebuilt more frequently than once every second.

Furthermore, the grid is the same for every user, this is a perfect candidate for caching the image data. I wanted to measure the performance benefits of generating the image once per second and caching this. This would in theory be beneficial in terms of CPU cycles for any scenario above 1 request per second. Also only the latest version of the image would need to be cached so the pressure on memory would be low.

Test components

In order to test the performance of the image fetching API, I wanted to limit the API to 2 CPUs, and would generate load to the endpoint using k6, whilst profiling the CPU usage using pprof.

cgroups

Control groups (cgroups) are a feature of the linux kernel which constrains the resources available to a selection of processes. In this case I want to constrain the server process to only use two CPUs.

cgroups are managed through a file system which can be found using the mount command.

mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate,memory_recursiveprot)

In this case, the file system is mounted at /sys/fs/cgroup, and it is of type cgroup2, which is the modern cgroup v2 interface. A new cgroup can be created by making a new directory in the cgroup directory. A number of files will be created under this new directory which can be used to control resource usage.

sudo mkdir /sys/fs/cgroup/place
ls -1 /sys/fs/cgroup/place | head -5
cgroup.controllers
cgroup.events
cgroup.freeze
cgroup.kill
cgroup.max.depth

In order to control CPU usage of a single process there are just two files that need to be considered.

cgroup.procs

This file is a list of the PIDs of the processes which belong to the cgroup. In this scenario, this is the list of processes that will have their CPU cycles constrained.

cpu.max

This file controls the max amount of CPU time that the cgroup can use. The format for the file is

<max> <period>

which indicates that the group can consume up to <max> microseconds in each <period> microseconds. Once the processes in the control group have exhausted the time specified by this quota they are throttled until the beginning of the following period.

50000 100000 

indicates that the group can consume up to 50 milliseconds of CPU time per 100 milliseconds.

Final configuration

To limit the Go process to 2 CPUs the cpu.max file can be updated to

200000 100000 

That is, I want the process to have access to 200 milliseconds of CPU time per 100 milliseconds.

After running the Go program the PID can be found using ps, which can be added to the cgroup.procs file.

k6

In order to get an idea of how many requests this setup could serve, load needs to be generated for the desired endpoint.

The endpoint in question is a HTTP GET endpoint which returns an image.

The goal was to measure how performant the current setup was before implementing any caching mechanism.

import http from "k6/http";

export const options = {
  scenarios: {
    place: {
      executor: 'ramping-arrival-rate',
      startRate: 10,
      timeUnit: '1s',
      preAllocatedVUs: 5,
      maxVUs: 10,
      stages: [
        { target: 100, duration: '3m' },
      ],
    },
  },
};

export default function() {
  http.get("http://localhost:8081/image");
}

This script will run the default function starting at 10 requests (startRate) per second (timeUnit) and would increase to 100 requests (stages[].target) over 3 minutes (stages[].duration).

The executor ramping-arrival-rate controls the rate that requests are being sent to the service.

The script can be run using the k6 cli.

pprof

In software engineering, profiling is a form of dynamic program analysis which measures, for example the time complexity of a program, or the frequency or duration of particular function calls.

Profiling of a Go application can be added within the code and can be read with the pprof cli.

In my case, my application exposed profiling information over HTTP (this is generally easier when writing a server).

// this package registers some handlers to 
// DefaultServerMux in the http package
import _ "net/http/pprof"
// starts the server with the profiling handlers
// in a new goroutine so it doesn't block
if config.ProfilingEnabled {
  go func() {
    err := http.ListenAndServe(":6060", nil)
    if err != nil {
      fmt.Println("failed to start pprof server")
    }
  }()
}

pprof can then read the data from this endpoint.

go tool pprof -http=:6061 http://localhost:6060/debug/pprof/profile?seconds=30s

Running the test

Run the Go application

go run cmd/main.go

Find the PID of the application process and add it to the cgroup.procs file.

ps aux | grep main
sudo sh -c 'echo <PID> > /sys/fs/cgroup/place/cgroup.procs'

This will limit the process to 2 CPUs

k6 run script.js

Saturated CPU

Using k6 along with top, shows that the CPU (which is being limited to 2 CPUs by the cgroup) reaches its limit at around 160 rps.

Seeing as the application is instrumented with the profiler profiling information can be read from the application.

go tool pprof -http=:6061 http://localhost:6060/debug/pprof/profile?seconds=30s

Once the pprof tool has taken the samples, it is possible to view the UI in the browser on localhost:6061.

Iterating

No Cache

As mentioned in the aims section, the initial iteration of the server had no caching implementation.

The flame graph from pprof for this iteration shows that the program spent a lot of time in the GetGrid and png.Encode functions.

Flame Graph Before Caching

This makes sense because for each request the server needs to get the raw pixels from the database and convert that pixel into the bytes of a PNG encoded image.

Caching the Pixel Data

My first target was the grid.Manager.GetGrid function. This function would reach out to the database for the pixel data, currently for every request. My idea was that the front end for the application would poll for the grid data (which would be transferred as a PNG) every second. Therefore, there is no real reason that the data would need to be updated every second.

So I added some fields to the Manager struct.

const ImageRefreshInterval = time.Second

type Manager struct {
	mu               sync.Mutex
	lastImageGen     time.Time
	latestImage      *image.RGBA
	backend          types.GridBackend
	height           int
	width            int
}

Then, I added a check to the code:

func (m *Manager) GetImage() (*image.RGBA, error) {
    m.mu.Lock()
    defer m.mu.Unlock()

    if time.Since(m.lastImageGen) > ImageRefreshInterval {
        image, err := m.generateLatestImage()
        if err != nil {
            return nil, err
        }
        m.latestImage = image
        m.lastImageGen = time.Now()
    }
    
    return m.latestImage, nil
}

At this point the Server.GetImage method receives the image data, encodes the image as a PNG and writes the data to the http.ResponseWriter object. I can now profile the app with the new changes.

Using the same setup as above the app reached around 500 requests per second. The flame chart looks like this.

Flame Graph after adding pixel data

Caching the bytes

Now for the majority of requests the app is fetching the image data from memory, then encoding the image data into a PNG, then writing to the http.ResponseWriter. The majority of the work is being done in the png.Encode function.

So instead of caching the image data and encoding this into a PNG I can be update it so that it caches the PNG as a byte array. I can update the Manager.GetImage to Manager.WritePng(w http.ResponseWriter).

type Manager struct {
	mu               sync.Mutex
	lastImageGen     time.Time
  // the PNG data is stored as bytes
	latestImageBytes []byte
	backend          types.GridBackend
	maxRate          Rate
	height           int
	width            int
}
func (m *Manager) WritePng(w io.Writer) error {
	m.mu.Lock()
	defer m.mu.Unlock()

  // if the last time the latest image generated was > 1 second ago
	if time.Since(m.lastImageGen) > ImageRefreshInterval {

    // grab the latest image from the DB
		image, err := m.generateLatestImage()
		if err != nil {
			return err
		}

    // encode as PNG and write to buffer
		buffer := new(bytes.Buffer)
		err = png.Encode(buffer, image)
		if err != nil {
			return err
		}

    // update cache
		m.latestImageBytes = buffer.Bytes()
		m.lastImageGen = time.Now()
	}

  // write to the response writer
	_, err := w.Write(m.latestImageBytes)
	return err
}

This removed the last computationally heavy components from my code. Running another load test, the application was easily able to serve up to 20,000 requests running on 2 CPUs. The previous functionality that was using so much CPU before (fetching the image data from the db and encoding as a PNG) was now using less than 3% of the total CPU.

Flame graph when caching the PNG

Conclusion

Assuming each user of this application would make 1 request per second to fetch the latest data, profiling made it possible to identify the bottlenecks and increase the max potential users from ~160 to ~20,000, over a 100x improvement.