Currently in preview with Docker Desktop 4.40 for macOS on Apple Silicon, Docker Model Runner allows developers to run models locally and iterate on application code using the local models— without disrupting their container-based workflows.
Using local LLMs for development offers several benefits, including lower costs, improved data privacy, reduced network latency, and greater control over the model.
Docker Model Runner addresses several pain points for developers integrating LLMs into containerized apps, such as dealing with different tools, configuring environments, and managing models outside of their containers. Additionally, there is no standard way to store, share, or serve models. To reduce the friction associated with that, Docker Model Runner includes
an inference engine as part of Docker Desktop, built on top of llama.cpp and accessible through the familiar OpenAI API. No extra tools, no extra setup, and no disconnected workflows. Everything stays in one place, so you can test and iterate quickly, right on your machine.
To avoid the typical performance overhead of virtual machines, Docker Model Runner uses host-based execution. This means models run directly on Apple Silicon and take advantage of GPU acceleration, which is crucial for inference speed and development cycle smoothness.
For model distribution, Docker is, unsurprisingly, betting on the OCI standard, the same specification that powers container distribution, aiming to unify both under a single workflow.
Today, you can easily pull ready-to-use models from Docker Hub. Soon, you’ll also be able to push your own models, integrate with any container registry, connect them to your CI/CD pipelines, and use familiar tools for access control and automation.
If you are using Docker Desktop 4.40 for macOS on Apple Silicon, you can use the docker model
command, which supports a workflow quite similar to the one you are used to with images and containers. For example, you can pull
a model and run
it. To specify the exact model version, such as its size or quantization, docker model
uses tags, e.g.:
docker model pull ai/smollm2:360M-Q4_K_M
docker model run ai/smollm2:360M-Q4_K_M "Give me a fact about whales."
However, the mechanics behind these commands are particular to models, as they do not actually create a container. Instead, the run
command will delegate the inference task to an Inference Server running as a native process on top of llama.cpp. The inference server loads the model into memory and caches it for a set period of inactivity.
You can use Model Runner with any OpenAI-compatible client or framework via its OpenAI endpoint at http://model-runner.docker.internal/engines/v1
available from within containers. You can also reach this endpoint from the host, provided you enable TCP host access running docker desktop enable model-runner --tcp 12434
.
Docker Hub hosts a variety of models ready to use for Model Runner, including smollm2 for on-device applications, as well as llama3.3 and gemma3. Docker has also published a tutorial on integrating Gemma 3 into a comment processing app using Model Runner. It walks through common tasks like configuring the OpenAI SDK to use local models, using the model itself to generate test data, and more.
Docker Model Runner isn’t the only option for running LLMs locally. If you’re not drawn to Docker’s container-centric approach, you might also be interested in checking out Ollama. It works as a standalone tool, has a larger model repository and community, and is generally more flexible for model customization. While Docker Model Runner is currently macOS-only, Ollama is cross-platform. However, although Ollama supports GPU acceleration on Apple Silicon when run natively, this isn’t available when running it inside a container.