Meta has officially released the first models in its new Llama 4 family—Scout and Maverick—marking a step forward in its open-weight large language model ecosystem. Designed with a native multimodal architecture and a mixture-of-experts (MoE) framework, these models aim to support a broader range of applications, from image understanding to long-context reasoning.
Llama 4 Scout includes 17 billion active parameters distributed across 16 experts, optimized to run on a single NVIDIA H100 GPU. It supports a 10 million token context window, making it suitable for general-purpose AI tasks. On the other hand, Llama 4 Maverick, also with 17 billion active parameters but utilizing 128 experts, provides enhanced capabilities in reasoning and coding, outperforming several models in its class based on Meta’s benchmarks.
Both models were distilled from Meta’s still-training flagship model, Llama 4 Behemoth, which has 288 billion active parameters and nearly two trillion total. Meta claims the Behemoth surpasses GPT-4.5, Claude 3 Sonnet, and Gemini 2.0 Pro on multiple STEM benchmarks. Despite not being fully released, Behemoth serves as a key training teacher for the smaller Scout and Maverick models.
Source: Meta AI Blog
Beyond model architecture, Meta emphasized a revamped training and post-training strategy, including lightweight supervised fine-tuning, reinforcement learning, and a new curriculum design for handling multimodal input. These changes were aimed at improving performance across difficult tasks while maintaining efficiency and reducing model bias.
While benchmark numbers show the Llama 4 models performing competitively with industry leaders like Gemini 2.0 and GPT-4, some early users are expressing skepticism:
Either they are terrible or there is something really wrong with their release/implementations. They seem bad at everything I have tried. Worse than 20-30Bs even and completely lack the most general of knowledge.
Another Reddit user added:
This has been my experience as well. I am genuinely hoping they are being run with the wrong settings right now and with a magic fix, they will perform at the levels their benchmark scores claim.
Some professionals in the field are noting inconsistencies. Uli Hitzel, an AI expert, shared a telling example:
The first results from Llama 4 Maverick are indeed impressive, but look – Maverick has 128 experts and it still tells me there are two T’s in “strawberry.” (We have moved on from counting R’s to T’s now…) This is a good reminder that even the most advanced, bare LLMs can produce utterly stupid results if we do not integrate them into a properly designed agentic workflow with appropriate checks and balances.
Meta has not yet directly addressed these performance concerns in public channels but encourages developers and researchers to try the models themselves. Llama 4 Scout and Maverick are now available for download on llama.com and Hugging Face.