A new post on Apple’s Machine Learning Research blog shows how much the M5 Apple silicon improved over the M4 when it comes to running a local LLM. Here are the details.
A bit of context
A couple of years ago, Apple released MLX, which the company describes as “an array framework for efficient and flexible machine learning on Apple silicon”.
In practice, MLX is an open-source framework that helps developers build and run machine learning models natively on their Apple silicon Macs, supported by APIs and interfaces that are familiar to the AI world.
Here’s Apple again on MLX:
MLX is an open source array framework that is efficient, flexible, and highly tuned for Apple silicon. You can use MLX for a wide variety of applications ranging from numerical simulations and scientific computing to machine learning. MLX comes with built in support for neural network training and inference, including text and image generation. MLX makes it easy to generate text with or fine tune of large language models on Apple silicon devices.
MLX takes advantage of Apple silicon’s unified memory architecture. Operations in MLX can run on either the CPU or the GPU without needing to move memory around. The API closely follows NumPy and is both familiar and flexible. MLX also has higher level neural net and optimizer packages along with function transformations for automatic differentiation and graph optimization.
One of the MLX packages available today is MLX LM, which is meant for generating text and for fine-tuning language models on Apple silicon Macs.
With MLX LM, developers and users can download most models available on Hugging Face, and run them locally.
This framework even supports quantization, which is a compression method that enables large models to run while using less memory. This leads to faster inference, which is basically the step during which the model produces an answer to an input or a prompt.
M5 vs. M4
In its blog post, Apple showcases the inference performance gains of the new M5 chip, thanks to the chip’s new GPU Neural Accelerators, which Apple says “provide[s] dedicated matrix-multiplication operations, which are critical for many machine learning workloads.”
To illustrate the performance gains, Apple compared the time it took for multiple open models to generate the first token after receiving a prompt on an M4 and an M5 MacBook Pro, using MLX LM.
Or, as Apple put it:
We evaluate Qwen 1.7B and 8B, in native BF16 precision, and 4-bit quantized Qwen 8B and Qwen 14B models. In addition, we benchmark two Mixture of Experts (MoE): Qwen 30B (3B active parameters, 4-bit quantized) and GPT OSS 20B (in native MXFP4 precision). Evaluation is performed with mlx_lm.generate, and reported in terms of time to first token generation (in seconds), and generation speed (in terms of token/s). In all these benchmarks, the prompt size is 4096. Generation speed was evaluated when generating 128 additional tokens.
These were the results:
One important detail here is that LLM inference takes different approaches to generate the very first token, compared to how it works under the hood to generate subsequent tokens. In a nutshell, first token inference is compute-bound, while subsequent token generation is memory-bound.
This is why Apple also evaluated generation speed for 128 additional tokens, as described above. And in general, the M5 showed a 19-27% performance boost compared to the M4.
Here’s Apple on these results:
On the architectures we tested in this post, the M5 provides 19-27% performance boost compared to the M4, thanks to its greater memory bandwidth (120GB/s for the M4, 153GB/s for the M5, which is 28% higher). Regarding memory footprint, the MacBook Pro 24GB can easily hold a 8B in BF16 precision or a 30B MoE 4-bit quantized, keeping the inference workload under 18GB for both of these architectures.
Apple also compared the performance difference for image generation, and said that the M5 did the job more than 3.8x faster than the M4.
You can read Apple’s full blog post here, and you can learn more about MLX here.
Accessory deals on Amazon
FTC: We use income earning auto affiliate links. More.
