The Docker build cache plays an important role in how long it takes to build an image. Getting familiar with how caching works can save you time and avoid surprises when working with Docker locally, or building images on a CI/CD service.
1. Docker caching fundamentals
Dockerfiles define the instructions to create the immutable layers that make up our image. Each instruction is translated to a layer and overlaid on top of the previous one.
When building your image, Docker determines for each instruction if there’s a cached layer that can be used or if it needs to compute and create it.
Docker uses a few basic rules that govern the behavior of the build cache. A layer is in the cache if:
- Its parent images or previous instructions are in the cache.
- Its instruction equals the one saved in the cache.
- If the instruction is an
COPY, the copied files’ contents are equal to the one in the cache.
This means that starting from the first layer that doesn’t satisfy these rules, it and all subsequent layers are invalidated and recomputed.
2. Optimal Dockerfile structure
What follows directly from the rules above is, if we want to optimize the effectiveness of the cache, we should aim to have layers that get invalidated most frequently, appear towards the end of our Dockerfile. This guarantees that fewer layers would need to be recomputed when they change.
In other words, we should structure our Dockerfile’s instructions from least frequently to most frequently changed. In the case of
COPY layers, this means that files that change frequently should be added as late as possible.
This is an example of what that could mean for a Node.js project:
FROM node:16CACHEDRUN apt-get install gccCACHEDCOPY package*.json /RUN npm installCOPY css/ .RUN npm run build-cssCOPY . .RUN npm run build
- System dependencies are slow to install and are not affected by our code. Therefore, putting the instruction early in the Dockerfile means they won’t get invalidated and recomputed as a result of code changes.
- Commands like
apt get update, based on the rules above, are cached and aren’t invalidated even if there updates to be installed.
- We only copy files needed to install our app’s dependencies. This guarantees that other code changes don’t trigger reinstallation.
3. Cache invalidation
"There are only two hard things in Computer Science: cache invalidation and naming things."
Equally as important is knowing how and when not to use the cache. Common cases where we want to make sure we recompute a layer:
- To update our base image - for example, if there’s a vulnerability that was fixed in Debian, we want to make sure we grab the fix.
- System packages - since these layers would ideally be first in our Dockerfile, they won’t be invalidated frequently, therefore we often need to explicitly run them to bring in up-to-date dependencies.
There are a few techniques for invalidating cache:
bash-3.2$ docker build Dockerfile --no-cache --pull
--no-cacheinstructs Docker to rebuild all of the layers in our Dockerfile.
--pullforces a fresh pull of the base image.
This is the most straightforward way to invalidate the cache and is often the simplest and most recommended. The disadvantage of course is that all layers will be rebuilt without any fine-grained control.
Cache busting with build args
ARG instruction gets invalidated, all subsequent layers will be computed. Therefore, if we place these
ARG instructions strategically, we can control exactly which layers are invalidated and when. For example:
FROM debian:bullseyeARG SYSTEM_DEPENDENCIES_VERSIONRUN apt-get install pythonARG NPM_DEPENDENCIES_VERSIONCOPY package.json /RUN npm install
This way we can trigger a reinstall of system or NPM dependencies by changing the values of
NPM_DEPENDENCIES_VERSION build arguments.
This solution works but requires fine-grained control and management of the build-arg values. These arguments also become a part of your final image and so should be used sparingly if at all.
4. BuildKit cache mount
BuildKit, Docker’s new build engine, introduced the concept of cache mounts. This allows you to mount directories to be used as part of the Docker image build process. These directories can be persisted between build runs and act as a cache.
For example, we can persist the
apt cache between runs which will improve the performance of
FROM ubuntu# Disable automatic cache cleanup so we can persist the apt cache between build runsRUN rm -f /etc/apt/apt.conf.d/docker-clean; \
echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/keep-cache# Run apt commands with mounted cache directoriesRUN --mount=type=cache,target=/var/cache/apt --mount=type=cache,target=/var/lib/apt \
apt update && apt-get --no-install-recommends install -y gcc
Here we define the directories
/var/lib/apt as cache directories. Their contents will be persisted between docker builds.
Cache mounts speed up the command itself, which will run when the layer isn’t cached. Persisting the package manager’s cache is just one example, we can persist and cache any directory to speed up subsequent runs. For example, build files, intermediate files, downloaded artifacts, etc.
More details on cache mounts can be found here.
5. Beyond your local environment
Caching gets slightly more complicated when you can’t guarantee you’re always running the build on the same machine. This is often the case in CI/CD services that use shared resources that often don’t persist the Docker build cache.
To handle this case we can pass in the
--cache-from parameter to specify images that act as a build cache:
bash-3.2$ DOCKER_BUILDKIT=1 docker build --cache-from library/my-image:latest .
For an image to serve as a cache it needs to be built and pushed with the appropriate metadata. To enable the caching metadata you must pass in the build argument
bash-3.2$ DOCKER_BUILDKIT=1 docker build -t library/my-image:latest \
--build-arg BUILDKIT_INLINE_CACHE=1 .
This metadata helps Docker decide if an appropriate cache exists without needing to pull the entire image.
Usually, in a CI/CD context, you would combine both
--cache-from to create a repeatable process of using previously built images as a cache, and pushing an updated version with the appropriate metadata for next time.
You can include multiple images as possible caches to increase the likelihood of cache hits. To do that, repeat
—cache-from for each cached image.
The Docker cache is powerful and extremely useful in any setting. Having a firm understanding of its fundamentals is essential to get the best out of Docker and to avoid pitfalls.