In a recent tech report, Apple has provided more details on the performance and characteristics of the new Apple Intelligence Foundation Models that will be part of iOS 26, as announced at the latest WWDC 2025.
Apple foundation models include a 3B-parameter version optimized to run on Apple Silicon-powered devices, as well as a larger model designed to run on Apple’s Private Cloud Compute platform. Apple emphasizes that both models were trained using responsible web crawling, licensed corpora, and synthetic data. A further training stage included supervised fine-tuning and reinforcement learning.
According to Apple, the 3B parameter model is designed for efficiency, low-latency, and minimal resource usage. The larger model, by contrast, aims to deliver high accuracy and scalability. Apple notes that, given its reduced size, the on-device model isn’t intended to implement a world-knowledge chat, but can support advanced capabilities such as text extraction, summarization, image understanding, and reasoning with just a few lines of code.
On the architecture side, the 3B-parameter model uses KV-cache sharing, a technique used to reduce the time-to-first-token, and is compressed using 2-bit quantization-aware training. Sharing the key-value caches between the two blocks the model is divided into enables a reduction of memory usage by 37.5%, says Apple. Quantization-aware training is a technique that allows to recover quality by simulating the effect of 2-bit quantization at training-time:
Unlike the conventional quantization scheme which derives the scale from weights W, we introduce a learnable scaling factor f that adaptively fine-tunes the quantization range for each weight tensor.
For the server-side model, Apple used a novel Parallel-Track Mixture-of-Experts (PT-MoE) transformer that combines track parallelism, sparse computation, and interleaved global–local attention. comprises multiple transformers that process tokens independently, each with its own set of MoE layers. Apple says that the combination of parallel token processing with the MoE approach delivers reduced synchronization overhead and allows the model to scale more efficiently.
To evaluate its foundation models, Apple researchers relied on human graders to assess each model’s ability to produce a native-sounding response. The results show that the on-device model performs well against Qwen-2.5-3B across all supported languages, and remains competitive with larger models like Qwen-3-4B and Gemma-3-4B in English. The larger server-side model performs favorably against Llama-4-Scout, but falls short compared to much larger models such as Qwen-3-235B and GPT-4o.
For image understanding, Apple followed the same approach by asking humans to evaluate image-question pairs, including text-rich images like infographics:
We found that Apple’s on-device model performs favorably against the larger InternVL and Qwen and competitively against Gemma, and our server model outperforms Qwen-2.5-VL, at less than half the inference FLOPS, but is behind Llama-4-Scout and GPT–4o.
As a final note, Apple researchers emphasizes their approach to Responsible AI, which includes enforcing a baseline of safety and guardrails to mitigate harmful model input and output. These safeguards were also evaluated through a combination of human assessment and auto-grading. Apple has also published educational resources for developers to apply Responsible AI principles.
As mentioned, Apple’s AI foundation models require XCode 26 and iOS 26 and are currently available as beta software.