Optimizing Docker image size and why it matters

January 6th, 2022

Why does size matter? Docker images are a core component in our development and production lifecycles. Having a large image can make every step of the process slow and tedious. Size affects how long it takes to build the image locally, on CI/CD, or in production and it affects how quickly we can spin up new instances that run our code. Reducing the size of the image can have benefits both for developers and your users.

Illustration of pulling a large Docker image layer

So, what can you do about it?

1. Pick an appropriate base image

There are several important considerations that go into picking a base image. In the context of optimizing image size, each base image comes with its own dependencies and footprint.

Usually, the first choice you need to make is which distribution you want. Image sizes vary between them:

Debian
124 MB
Ubuntu
73 MB
Alpine
6 MB
CentOS
231 MB
Fedora
153 MB

It’s not just a matter of image size though, each of these images comes with its own philosophy or tools you might prefer. Alpine is lightweight, security-focused, and based on musl-libc instead of glibc. Ubuntu has long-term enterprise support, comes bundled with many utilities and supports a vast amount of packages, and so on.

Next, you can decide if you want your parent image to come bundled with additional dependencies. Often you need to weigh the convenience of having a base image with all dependencies pre-installed against the size of the resulting image.

For example, if you have a Node.js app you can use the node image, or python if you’re using Python, etc. Within those images, usually you can specify the distribution you’d like using the appropriate tag, for example, node:alpine for Alpine Linux and python:3-bullseye for Debian Bullseye.

The less specific or specialized your parent image is, the more control you have over the image size:

# Image size: 934 MB
 
 
CMD ["node"]
# Image size: 313 MB
 
 
RUN apt-get update && \
apt-get install -y curl && \
(curl -sL https://deb.nodesource.com/setup_16.x | \
bash -) && \
apt-get install -y nodejs
 
CMD ["node"]

A closer look at node:16-bullseye shows that it has buildpack-deps as its parent image, which comes with lots of dependencies you might not need. So if you’re willing to take care of the Node.js installation, you can do it directly on the Debian image and reduce the image size considerably.

2. Add only files you need

Docker makes it especially easy to add files you didn’t mean to add to an image. Each ADD or COPY and even the RUN commands in your Dockerfile can include files you weren’t expecting.

RUN apt-get install -y nodejs
COPY . .
RUN npm install
CMD ["node"]
.git
images
node_modules
src
package.json
package-lock.json
README.md

It isn't easy to see exactly which files are added and where. So the first step is to be able to quickly inspect which files are added to each layer. Each layer corresponds to specific commands in your Dockerfile, and from there we can decide what and how to optimize.

There are 3 easy methods you can use:

Docker CLI

You can save any local image as a tar archive and then inspect its contents.

bash-3.2$ docker save <image-digest> -o image.tar
bash-3.2$ tar -xf image.tar -C image
bash-3.2$ cd image
bash-3.2$ tar -xf <layer-digest>/layer.tar
bash-3.2$ ls
etc
tmp
usr
var

Dive

An excellent open-source tool to visualize and analyze local Docker images.

Contains.dev

Our contains.dev offers many tools to analyze layers, their contents, and their size. Including navigating a treemap of your image:

With these methods, you're set up to assess improvements to your image size. There are a few common areas that have straightforward solutions that improve the overall image size:

.dockerignore

An important way to ensure you’re not bringing in unintended files is to define a .dockerignore file. This file has a similar syntax to .gitignore :

# Ignore git and caches
.git
.cache
 
# Ignore logs
logs
 
# Ignore secrets
.env
 
# Ignore installed dependencies
node_modules
 
...

Then when you run COPY . . it’ll make sure not to copy files defined in your .dockerignore. Defining this file has the added benefit of reducing the size of the Docker build context, which are all the files Docker gathers when building an image. A smaller build context results in faster build times.

Package managers

Depending on the package manager you’re using, you can instruct it to install the minimum needed packages you explicitly defined.

For example:

  • apt-get -y --no-install-recommends - don’t install optional recommended packages.
  • npm install --production - don’t install development dependencies.

Caches

Many processes will create temporary files, caches, and other files that have no benefit to your specific use case. For example, running apt-get update will update internal files that you don’t need to persist because you’ve already installed all the packages you need. So we can add rm -rf /var/lib/apt/lists/* as part of the same layer to remove those (removing them with a separate RUN will keep them in the original layer, see “Avoid duplicating files”). Docker recognize this is an issue and went as far as adding apt-get clean automatically for their official Debian and Ubuntu images.

Each layer in your image might have a leaner version that is sufficient for your needs. The best way to see that is to audit your layers with the techniques mentioned above.

3. Avoid duplicating files

Docker uses read-only layers of files that are overlaid on top of each other. This means that when you make changes to files that come from previous layers, they’re copied into the new layer you’re creating. It isn’t always obvious that this is happening, for example:

 
COPY somefile.txt .
 
RUN chmod 777 somefile.txt

We’re just chmod'ing an existing file, but Docker can’t change the file in its original layer, so that results in a new layer where the file is copied in its entirety with the new permissions.

In newer versions of Docker, this can now be written as the following to avoid this issue using Docker’s BuildKit:

 
COPY --chmod=777 somefile.txt .

Other non-intuitive cases of file duplication between layers:

 
COPY somefile.txt . #1
 
# Small change but entire file is copied
RUN echo "more data" >> somefile.txt #2
 
# File moved but layer now contains an entire copy of the file
RUN mv somefile.txt somefile2.txt #3
 
# File won't exist in this layer,
# but it still takes up space in the previous ones.
RUN rm somefile2.txt

In this example, we created 3 copies of our file throughout different layers of the image. Despite removing the file in the last layer, the image still contains the file in other layers which contributes to the overall size of the image.

Making a small change to a file or moving it will create an entire copy of the file. Deleting a file will only hide it from the final image, but it will still exist in its original layer, taking up space. This is all a result of how images are structured as a series of read-only layers. This provides reusability of layers and efficiencies with regards to how images are stored and executed. But this also means we need to be aware of the underlying structure and take it into account when we create our Dockerfile.

4. Multi-stage builds

For cases where we have Dockerfile steps that aren’t used at runtime.

The Dockerfile might include several steps that take care of setting up an environment for compiling the program that will run at runtime. This is especially common for compiled languages like Go.

To solve this issue Docker introduced multi-stage builds starting from Docker Engine v17.05. This allows us to perform all preparation steps as before, but then copy only the essential files or output from these steps.

As shown in the example below, the effects on image size can be dramatic:

# Image size: 961 MB
 
 
WORKDIR /workspace
COPY . .
RUN go get && go build -o main .
 
CMD ["/workspace/main"]
# Image size: 7 MB
 
FROM golang:1.17.5 as builder
 
WORKDIR /workspace
COPY . .
ENV CGO_ENABLED=0
RUN go get && go build -o main .
 
FROM scratch
WORKDIR /workspace
COPY --from=builder \
/workspace/main \
/workspace/main
 
CMD ["/workspace/main"]

This basic example compiles a simple Go program. The naive way on the left results in a 961 MB image. When using a multi-stage build, we copy just the compiled binary which results in a 7 MB image. The example on the left can be improved by choosing a leaner parent image, but it still would fall short of the optimal case on the right.

Multi-stage builds introduce a lot of flexibility with support for advanced cases like multiple FROM statements, copying a single file from an external image, and more. These techniques can be combined to reduce the image size to a minimum. Check out the official Docker docs for more info.

 

Keeping your image optimized and small pays huge dividends in the development process and in going to production. The techniques above will help you gain a good understanding of what's going on inside your image, which has benefits beyond the optimization work.

Want to learn more about your Docker images? Get started with Contains.dev
See also:

Mastering the Docker cache

March 10th, 2022
Getting familiar with how caching works can save you time and avoid surprises when working with Docker locally, or building images on a CI/CD service.