Kimi released K2, a Mixture-of-Experts large language model with 32 billion activated parameters and 1.04 trillion total parameters, trained on 15.5 trillion tokens. The release introduces MuonClip, a new optimizer that builds on the Muon optimizer by adding a QK-clip technique designed to address training instability, which the team reports resulted in “zero loss spike” during pre-training. The model comes in two variants: a base version and K2 Thinking, with the latter claiming state-of-the-art results on benchmarks testing reasoning, coding, and agent capabilities, including 44.9% on Humanity’s Last Exam (HLE) with tools, 60.2% on BrowseComp, and 71.3% on SWE-Bench Verified. The release positions K2 as a contender in the open-source model space, particularly for software engineering and agentic tasks where the model aims to demonstrate strong generalization capabilities.
The team validated MuonClip through a series of scaling experiments. They first trained a mid-scale model with 9 billion activated parameters and 53 billion total parameters using the standard Muon optimizer. The researchers then tested whether QK-Clip affects model performance, finding that MuonClip maintains the optimization characteristics of Muon without negatively impacting the loss trajectory. For the full-scale Kimi K2 model, the team applied MuonClip with a tau value of 100 (τ = 100) and tracked maximum attention logits throughout training. The maximum logits gradually decreased to a normal operating range during the training process without requiring manual adjustments, which the team presents as evidence of the optimizer’s stability improvements.
Source: Kimi K2 Benchmark Results
Kimi trained K2 on a cluster of NVIDIA H800 GPUs, with each node containing 2 TB of RAM and 8 GPUs connected through NVLink and NVSwitch. The cluster uses 8×400 Gbps RoCE interconnects for cross-node communication. The team designed a flexible parallelism strategy that allows training on any number of nodes that is a multiple of 32, addressing what they describe as dynamic resource availability during large language model training.
To manage memory usage, the team applied selective recomputation to specific operations including LayerNorm, SwiGLU, and multi-head latent attention (MLA) up-projections, choosing what they characterize as inexpensive but high-footprint stages. The training process also recomputes MoE down-projections to further reduce activation memory requirements.
The model can execute 200 to 300 sequential tool calls driven by long-horizon planning and adaptive reasoning. K2 Thinking performs cycles that follow a pattern of think → search → browser use → think → code, generating and refining hypotheses while verifying evidence and constructing answers. This approach allows the model to break down ambiguous, open-ended problems into actionable subtasks.
For deployment, the team addressed inference efficiency challenges specific to thinking models. While low-bit quantization reduces inference latency and GPU memory usage, thinking models generate long output sequences that typically cause performance degradation when quantized. Kimi applied Quantization-Aware Training (QAT) during the post-training phase, using INT4 weight-only quantization on the MoE components. This implementation enables K2 Thinking to run native INT4 inference with approximately 2x generation speed improvement.
The Kimi K2 license includes a commercial use requirement. Organizations using the model or its derivatives for commercial products or services that exceed 100 million monthly active users or generate more than 20 million US dollars in monthly revenue must prominently display “Kimi K2” on the user interface of such products or services. This attribution requirement differentiates K2’s license from standard open-source licenses that typically do not mandate user-facing acknowledgments for high-scale commercial deployments.
Awni Hannun tested K2 Thinking on Apple Silicon, reporting performance results that demonstrate the model’s accessibility beyond datacenter infrastructure. Hannun stated
The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format – no loss in quality! The model was quantization aware trained (qat) at int4. Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm.
Artificial Analysis, which provides independent analysis of AI models, stated that
Kimi K2 Thinking is the new leading open weights model: it demonstrates particular strength in agentic contexts but is very verbose, generating the most tokens of any model in completing our Intelligence Index evals.
One commenter on Hacker News noted that
the ultimate competition between models will eventually become a competition over energy. China’s open-source models have major advantages in energy consumption, and China itself has a huge advantage in energy resources. They may not necessarily outperform the U.S., but they probably won’t fall too far behind either.
Kimi K2 enters a competitive open-source model landscape that includes DeepSeek-R1, which also focuses on extended reasoning, Alibaba’s Qwen models with QwQ for reasoning tasks, Mistral’s Mixtral MoE series, and Meta’s Llama 3 family.
The K2 Thinking variant is available on kimi.com and through the Moonshot API platform. The team has released the model weights on Hugging Face, where technical details and implementation guidance are accessible. Complete API documentation is available on the Moonshot platform, providing integration specifications for developers looking to incorporate K2 into their applications.
