Mastering the Docker cache

March 10th, 2022

The Docker build cache plays an important role in how long it takes to build an image. Getting familiar with how caching works can save you time and avoid surprises when working with Docker locally, or building images on a CI/CD service.

Docker build cache in action

1. Docker caching fundamentals

Dockerfiles define the instructions to create the immutable layers that make up our image. Each instruction is translated to a layer and overlaid on top of the previous one.

When building your image, Docker determines for each instruction if there’s a cached layer that can be used or if it needs to compute and create it.

Docker uses a few basic rules that govern the behavior of the build cache. A layer is in the cache if:

  1. Its parent images or previous instructions are in the cache.
  2. Its instruction equals the one saved in the cache.
  3. If the instruction is an ADD or a COPY, the copied files’ contents are equal to the one in the cache.

This means that starting from the first layer that doesn’t satisfy these rules, it and all subsequent layers are invalidated and recomputed.

2. Optimal Dockerfile structure

What follows directly from the rules above is, if we want to optimize the effectiveness of the cache, we should aim to have layers that get invalidated most frequently, appear towards the end of our Dockerfile. This guarantees that fewer layers would need to be recomputed when they change.

In other words, we should structure our Dockerfile’s instructions from least frequently to most frequently changed. In the case of ADD and COPY layers, this means that files that change frequently should be added as late as possible.

This is an example of what that could mean for a Node.js project:

FROM node:16
CACHED
RUN apt-get install gcc
CACHED
COPY package*.json /
RUN npm install
COPY css/ .
RUN npm run build-css
COPY . .
RUN npm run build
Select a file to see how changing it affects the docker build
  1. System dependencies are slow to install and are not affected by our code. Therefore, putting the instruction early in the Dockerfile means they won’t get invalidated and recomputed as a result of code changes.
  2. Commands like apt get update , based on the rules above, are cached and aren’t invalidated even if there updates to be installed.
  3. We only copy files needed to install our app’s dependencies. This guarantees that other code changes don’t trigger reinstallation.

3. Cache invalidation

"There are only two hard things in Computer Science: cache invalidation and naming things."

- Phil Karlton

Equally as important is knowing how and when not to use the cache. Common cases where we want to make sure we recompute a layer:

  • To update our base image - for example, if there’s a vulnerability that was fixed in Debian, we want to make sure we grab the fix.
  • System packages - since these layers would ideally be first in our Dockerfile, they won’t be invalidated frequently, therefore we often need to explicitly run them to bring in up-to-date dependencies.

There are a few techniques for invalidating cache:

 

The no-cache flag

bash-3.2$ docker build Dockerfile --no-cache --pull
  • --no-cache instructs Docker to rebuild all of the layers in our Dockerfile.

  • --pull forces a fresh pull of the base image.

This is the most straightforward way to invalidate the cache and is often the simplest and most recommended. The disadvantage of course is that all layers will be rebuilt without any fine-grained control.

 

Cache busting with build args

If an ARG instruction gets invalidated, all subsequent layers will be computed. Therefore, if we place these ARG instructions strategically, we can control exactly which layers are invalidated and when. For example:

 
ARG SYSTEM_DEPENDENCIES_VERSION
 
RUN apt-get install python
 
ARG NPM_DEPENDENCIES_VERSION
 
COPY package.json /
 
RUN npm install

This way we can trigger a reinstall of system or NPM dependencies by changing the values of SYSTEM_DEPENDENCIES_VERSION and NPM_DEPENDENCIES_VERSION build arguments.

This solution works but requires fine-grained control and management of the build-arg values. These arguments also become a part of your final image and so should be used sparingly if at all.

4. BuildKit cache mount

BuildKit, Docker’s new build engine, introduced the concept of cache mounts. This allows you to mount directories to be used as part of the Docker image build process. These directories can be persisted between build runs and act as a cache.

For example, we can persist the apt cache between runs which will improve the performance of apt install.

FROM ubuntu
 
# Disable automatic cache cleanup so we can persist the apt cache between build runs
RUN rm -f /etc/apt/apt.conf.d/docker-clean; \
    echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/keep-cache
 
# Run apt commands with mounted cache directories
RUN --mount=type=cache,target=/var/cache/apt  --mount=type=cache,target=/var/lib/apt \
    apt update && apt-get --no-install-recommends install -y gcc

Here we define the directories /var/cache/apt and /var/lib/apt as cache directories. Their contents will be persisted between docker builds.

Cache mounts speed up the command itself, which will run when the layer isn’t cached. Persisting the package manager’s cache is just one example, we can persist and cache any directory to speed up subsequent runs. For example, build files, intermediate files, downloaded artifacts, etc.

More details on cache mounts can be found here.

5. Beyond your local environment

Caching gets slightly more complicated when you can’t guarantee you’re always running the build on the same machine. This is often the case in CI/CD services that use shared resources that often don’t persist the Docker build cache.

To handle this case we can pass in the --cache-from parameter to specify images that act as a build cache:

bash-3.2$ DOCKER_BUILDKIT=1 docker build --cache-from library/my-image:latest .

For an image to serve as a cache it needs to be built and pushed with the appropriate metadata. To enable the caching metadata you must pass in the build argument BUILDKIT_INLINE_CACHE=1:

bash-3.2$ DOCKER_BUILDKIT=1 docker build -t library/my-image:latest \
          --build-arg BUILDKIT_INLINE_CACHE=1 .

This metadata helps Docker decide if an appropriate cache exists without needing to pull the entire image.

Usually, in a CI/CD context, you would combine both BUILDKIT_INLINE_CACHE=1 and --cache-from to create a repeatable process of using previously built images as a cache, and pushing an updated version with the appropriate metadata for next time.

You can include multiple images as possible caches to increase the likelihood of cache hits. To do that, repeat —cache-from for each cached image.

Most CI/CD services have specific guides on how to get this or similar behavior - for example, Bitbucket, GitHub, CircleCI and GitLab.

 

The Docker cache is powerful and extremely useful in any setting. Having a firm understanding of its fundamentals is essential to get the best out of Docker and to avoid pitfalls.

Want to learn more about your Docker images? Get started with Contains.dev
See also:

Optimizing Docker image size and why it matters

January 6th, 2022
Docker images are a core component in our development and production lifecycles. Having a large image can make every step of the process slow and tedious. So what can you do about it?