Introduction
Have you ever stared at your terminal, waiting for a Docker build, and wondered why a tiny code change triggered a 10-minute recompilation of your entire project? Or why your final image is hundreds of megabytes larger than you think it should be? These aren’t quirks of a mysterious system; they are the predictable outcomes of understandable mechanics. The difference between a frustratingly slow workflow and an efficiently fast one often comes down to understanding the engine of docker build
.
This article is a guide to that engine room. We will demystify the build process by mastering three pillars of efficiency: the layer caching system, the art of the RUN
command, and the role of the .dockerignore
file as the gatekeeper to your build. By the end, you will not just know what commands to run, but why they work, empowering you to craft truly professional and optimized containers. This project features a simple AI application that uses a BERT model for text classification, and we’ll use its Dockerfile from our layered_image
project as a case study to illustrate these core principles.
The Foundation: Docker Layers and the Build Cache – The Immutable Ledger
Imagine your Docker image not as a single, monolithic file, but as a stack of precisely defined changes, like an immutable ledger where each transaction is recorded on a new page. This is the essence of Docker’s layered filesystem. Each instruction in your Dockerfile
—FROM
, COPY
, RUN
, CMD
, etc. typically creates a new layer. This layer doesn’t contain a full copy of the filesystem; instead, it records only the differences introduced by that specific instruction compared to the layer beneath it. If a RUN apt-get install curl
command adds curl
, that layer essentially says “+ curl and its dependencies.” If a subsequent COPY my_script.py /app/
adds a script, that new layer says “+ /app/my_script.py.”
This layered approach is ingenious for efficiency. When you pull an image, Docker only downloads layers it doesn’t already have. When you build images that share common base layers (like python:3.10-slim
), those base layers are stored once and shared.
Building upon this layered filesystem is the Docker build cache. It’s Docker’s memory of past operations. When you issue a docker build
command, Docker steps through your Dockerfile
instruction by instruction. For each instruction, it checks three things:
- The exact instruction itself (e.g.,
COPY my_file.txt /dest/
). - The content of any files involved in that instruction (e.g., the checksum of
my_file.txt
). - The parent image layer upon which this instruction is based.
If Docker finds an existing layer in its cache that was created from the exact same parent layer using the exact same instruction with the exact same input files, it reuses that cached layer instantly. This is a cache hit.
However, if any of these conditions change for e.g. if the instruction is different, if a copied file’s content has changed, or if the parent layer is different (because a previous instruction was a cache miss), then Docker experiences a cache bust. When a cache bust occurs, Docker must execute that instruction from scratch, creating a new layer. Critically, all subsequent instructions in the Dockerfile
will also be executed from scratch, regardless of whether they might have matched the cache on their own. The cache is invalidated from that point downwards.
This leads to the golden rule of caching: Order instructions from least frequently changed to most frequently changed. Think of it like organizing your desk: things you rarely touch go in the back drawers; things you use constantly stay on top.
Interactive Experiment to Feel the Cache:
- First, build the layered_image (which has a cache-friendly order) using a command like
time docker build -t bert-classifier:layered -f layered_image/Dockerfile layered_image/
. For us, this initial build took about 23 seconds. - Now, open
layered_image/app/predictor.py
and make a trivial change, like adding a comment. Rebuild the image:time docker build -t bert-classifier:layered -f layered_image/Dockerfile layered_image/
. The build should complete in less than a second. Why? Docker seesFROM
,WORKDIR
,COPY runtime_requirements.txt
are unchanged and reuses their layers. It sees theRUN pip install
instruction is the same and its input (runtime_requirements.txt
) hasn’t changed its content, so it reuses the massive layer created bypip install
. Only when it reachesCOPY layered_image/app/ ./app/
does it detect a change (your modifiedpredictor.py
), so it rebuilds that layer and subsequent ones. If you want proof, go ahead and add the—progress=plain
flag to the end the build command. Docker CLI will show you the cached layers. - Next, the crucial test for understanding cache invalidation: edit your
layered_image/Dockerfile
. Move the lineCOPY layered_image/app/ ./app/
to before theRUN pip install ...
line. Make one more trivial change tolayered_image/app/predictor.py
and rebuild. What happens? The build takes the full 23 seconds again! The change toapp/predictor.py
busted the cache at the (now earlier)COPY ./app/
step. Because thepip install
step comes after this cache bust, it too is forced to re-run from scratch, even thoughruntime_requirements.txt
didn’t change.
This experiment powerfully demonstrates how a cache bust cascades and why the order of your Dockerfile instructions is paramount for a fast development loop. Here’s the cache-friendly structure we advocate from our layered_image
project:
# Cache-Friendly Order (from layered_image/Dockerfile runtime stage)
FROM python:3.10-slim AS runtime
WORKDIR /app
# 1. Copy requirements first (changes less often than app code)
COPY layered_image/runtime_requirements.txt ./runtime_requirements.txt
# 2. Install dependencies (slow step, now cached if requirements.txt doesn't change)
RUN pip install --no-cache-dir -r runtime_requirements.txt # (Full command shown later)
# 3. Copy app code last (changes most often)
COPY layered_image/app/ ./app/
COPY layered_image/sample_data/ ./sample_data/
CMD ["python", "app/predictor.py", "sample_data/sample_text.txt"]
The Art of the RUN
Command: Chaining for Microscopic Layers
The pursuit of an efficient Dockerfile
has a parallel in the physical world: trying to minimize the volume of a collection of items. Each RUN
command in your Dockerfile creates a new layer. If you download a tool, use it, and then delete it in separate RUN
commands, you’re like someone putting an item in a box, then putting an empty wrapper for that item in another box on top. The original item is still there, in the lower box, taking up space, even if the top box says “it’s gone.”
Specifically, files created in one layer cannot be truly removed from the overall image size by a command in a subsequent layer. The subsequent layer simply records that those files are “deleted” or “hidden,” but the bits comprising those files still exist in the image’s historical layers. This is what tools like dive
often report as “wasted space.”
Consider this anti-pattern:
# Anti-Pattern: Separate RUN commands leading to bloat
FROM python:3.10-slim
WORKDIR /app
COPY runtime_requirements.txt .
RUN pip install --no-cache-dir -r runtime_requirements.txt # Step 1: Install
RUN pip cache purge # Step 2: Cleanup attempt 1
RUN rm -rf /tmp/* /var/tmp/* # Step 3: Cleanup attempt 2
# ... (further cleanup attempts)
If you were to build image and the run docker history bert-classifier-layers
, you’d observe the output for each RUN
step. The first RUN pip install...
step would show a significant amount of data being written ( approx 679MB). The subsequent RUN pip cache purge
and RUN rm -rf /tmp/
steps would show very little data written for their layers, perhaps only a few kilobytes. This is because they aren’t removing data from the previous 679MB layer; they are just adding new, small layers on top that mark those files as deleted. The 679MB layer remains part of the image history.
docker history bert-classifier-layers
IMAGE CREATED CREATED BY SIZE COMMENT
f09d44f97ab4 34 minutes ago CMD ["python" "app/predictor.py" "sample_dat… 0B buildkit.dockerfile.v0
<missing> 34 minutes ago COPY layered_image/sample_data/ ./sample_dat… 376B buildkit.dockerfile.v0
<missing> 34 minutes ago COPY layered_image/app/ ./app/ # buildkit 5.51kB buildkit.dockerfile.v0
<missing> 34 minutes ago RUN /bin/sh -c rm -rf /tmp/* /var/tmp/* && … 0B buildkit.dockerfile.v0
<missing> 34 minutes ago RUN /bin/sh -c pip cache purge # buildkit 6.21kB buildkit.dockerfile.v0
<missing> 34 minutes ago RUN /bin/sh -c pip install --no-cache-dir -r… 679MB buildkit.dockerfile.v0
<missing> 34 minutes ago COPY layered_image/runtime_requirements.txt … 141B buildkit.dockerfile.v0
<missing> 3 hours ago WORKDIR /app 0B buildkit.dockerfile.v0
<missing> 11 days ago CMD ["python3"] 0B buildkit.dockerfile.v0
<missing> 11 days ago RUN /bin/sh -c set -eux; for src in idle3 p… 36B buildkit.dockerfile.v0
<missing> 11 days ago RUN /bin/sh -c set -eux; savedAptMark="$(a… 46.4MB buildkit.dockerfile.v0
<missing> 11 days ago ENV PYTHON_SHA256=ae665bc678abd9ab6a6e1573d2… 0B buildkit.dockerfile.v0
<missing> 11 days ago ENV PYTHON_VERSION=3.10.18 0B buildkit.dockerfile.v0
<missing> 11 days ago ENV GPG_KEY=A035C8C19219BA821ECEA86B64E628F8… 0B buildkit.dockerfile.v0
<missing> 11 days ago RUN /bin/sh -c set -eux; apt-get update; a… 9.17MB buildkit.dockerfile.v0
<missing> 11 days ago ENV LANG=C.UTF-8 0B buildkit.dockerfile.v0
<missing> 11 days ago ENV PATH=/usr/local/bin:/usr/local/sbin:/usr… 0B buildkit.dockerfile.v0
<missing> 11 days ago # debian.sh --arch 'arm64' out/ 'bookworm' '… 97.2MB debuerreotype 0.15
The solution is to perform all related operations, especially creation and cleanup of temporary files or tools, within a single RUN
command, chaining them with &&
. This ensures that any temporary artifacts exist only ephemerally during the execution of that single RUN
command and are gone before the layer is finalized and committed.
Let’s look at the aggressive cleanup RUN
command from our layered_image/Dockerfile
.:
RUN pip install --no-cache-dir -r runtime_requirements.txt &&
pip cache purge &&
rm -rf /tmp/* /var/tmp/* &&
find /usr/local/lib/python*/site-packages/ -name "*.pyc" -delete &&
find /usr/local/lib/python*/site-packages/ -name "__pycache__" -type d -exec rm -rf {} + || true
This command is a carefully choreographed dance:
-
pip install --no-cache-dir -r runtime_requirements.txt
: Installs Python packages without leaving downloaded wheel files in pip’s HTTP cache. -
pip cache purge
: Explicitly clears out any other cache pip might maintain. -
rm -rf /tmp/* /var/tmp/
: Removes files from standard temporary directories. -
find ... -name “
.pyc" -delete
: Deletes compiled Python byte code files. -
find ... -name “
pycache
" -type d -exec rm -rf {} +
: Removes thepycache
directories. -
|| true
: Ensures theRUN
command succeeds even iffind
doesn’t locate any files (which can return a non-zero exit code).
The Impact (Showcased with docker history
):
With this single, chained RUN
command, the resulting layer for our layered_image
project is 572MB. If these steps were unchained, the initial pip install
would create a layer of approximately 679MB. The docker history
command would reflect this:
docker history bert-classifier-layers
IMAGE CREATED CREATED BY SIZE COMMENT
17d0319094f4 2 minutes ago CMD ["python" "app/predictor.py" "sample_dat… 0B buildkit.dockerfile.v0
<missing> 2 minutes ago COPY layered_image/sample_data/ ./sample_dat… 376B buildkit.dockerfile.v0
<missing> 2 minutes ago COPY layered_image/app/ ./app/ # buildkit 5.51kB buildkit.dockerfile.v0
<missing> 2 minutes ago RUN /bin/sh -c pip install --no-cache-dir -r… 572MB buildkit.dockerfile.v0
<missing> 2 minutes ago COPY layered_image/runtime_requirements.txt … 141B buildkit.dockerfile.v0
<missing> 3 hours ago WORKDIR /app 0B buildkit.dockerfile.v0
<missing> 11 days ago CMD ["python3"] 0B buildkit.dockerfile.v0
<missing> 11 days ago RUN /bin/sh -c set -eux; for src in idle3 p… 36B buildkit.dockerfile.v0
<missing> 11 days ago RUN /bin/sh -c set -eux; savedAptMark="$(a… 46.4MB buildkit.dockerfile.v0
<missing> 11 days ago ENV PYTHON_SHA256=ae665bc678abd9ab6a6e1573d2… 0B buildkit.dockerfile.v0
<missing> 11 days ago ENV PYTHON_VERSION=3.10.18 0B buildkit.dockerfile.v0
<missing> 11 days ago ENV GPG_KEY=A035C8C19219BA821ECEA86B64E628F8… 0B buildkit.dockerfile.v0
<missing> 11 days ago RUN /bin/sh -c set -eux; apt-get update; a… 9.17MB buildkit.dockerfile.v0
<missing> 11 days ago ENV LANG=C.UTF-8 0B buildkit.dockerfile.v0
<missing> 11 days ago ENV PATH=/usr/local/bin:/usr/local/sbin:/usr… 0B buildkit.dockerfile.v0
<missing> 11 days ago # debian.sh --arch 'arm64' out/ 'bookworm' '… 97.2MB debuerreotype 0.15
This direct comparison in layer size demonstrates a saving of 107MB simply by structuring the cleanup correctly within the same RUN
instruction.
The Gatekeeper: Mastering .dockerignore
Our final principle concerns: the very beginning of the build process. When you execute docker build .
, the .
(or any path you specify) defines the “build context.” Docker meticulously packages everything within this path (respecting the .dockerignore
file, of course) into an archive and transmits it to the Docker daemon. The daemon then unpacks this context and uses it as the sole source of local files for any COPY
or ADD
instructions in your Dockerfile
. It has no access to anything on your filesystem outside this context.
The problem, particularly for AI projects, is that our project directories are often treasure troves of files utterly irrelevant to the final runtime image: local datasets, model checkpoints, Jupyter notebooks, Python virtual environments, and the entire .git
history. Sending a multi-gigabyte context isn’t just slow (especially if your daemon is remote, like in many CI systems), it’s also a security and cleanliness concern. You risk accidentally COPY
ing sensitive information or development artifacts into your image.
The .dockerignore
file is your vigilant gatekeeper. It’s a simple text file, placed in the root of your build context, that uses patterns (much like .gitignore
) to specify which files and directories should be excluded from the context before it’s ever packaged and sent to the daemon.
A comprehensive .dockerignore
for an AI project might look like this:
# .dockerignore
# Python virtual environments
.venv/
env/
venv/
# Python caches and compiled files
__pycache__/
*.py[cod] # .pyc, .pyo, .pyd
*.egg-info/
dist/
build/
*.so # Compiled shared objects, unless explicitly needed and copied
# IDE and OS specific
.vscode/
.idea/
*.swp
*.swo
.DS_Store
Thumbs.db
# Notebooks and exploratory artifacts
notebooks/
*.ipynb_checkpoints
# Test-related files (if not run inside the container build)
tests/
.pytest_cache/
htmlcov/
.coverage
# Large data or model files not intended for baking into the image
data/
models/
model_checkpoints/
*.pt
*.onnx
*.h5
# Log files
*.log
# Dockerfile itself (usually not needed to be COPIED into the image)
# Dockerfile
# Version control (see note below)
# .git
# .gitignore
By meticulously defining what to ignore, you ensure the build context is lean. This speeds up the initial “Sending build context to Docker daemon…” step, reduces the chance of accidental data inclusion, and makes your COPY . .
commands safer and more predictable.
Conclusion
In some sense, a Dockerfile
is just another tool. Yet, by delving into its mechanics, by understanding how it transforms your instructions into an image, you gain a craftsman’s control. We’ve seen that the deliberate ordering of instructions to honor the build cache can turn minutes of waiting into seconds of action. We’ve learned that the artful chaining within RUN
commands isn’t just about syntax; it’s about sculpting lean, efficient layers. And we’ve recognized the .dockerignore
file not as a minor detail, but as a crucial guardian of our build process’s integrity and speed.
These principles, layers, caching, chaining, and context management, are fundamental. Mastering them is key to moving beyond simply creating Docker images to truly engineering them for efficiency, speed, and cleanliness, especially in the demanding world of AI.
Your Turn
Now that you understand these mechanics, revisit your own Dockerfiles. Can you reorder layers for better caching? Can you chain RUN
commands for more aggressive cleanup? Implement a robust .dockerignore
. Share your findings or questions in the comments below!