OpenAI has released gpt-oss-120b and gpt-oss-20b, two open-weight language models designed for high-performance reasoning, tool use, and efficient deployment. These are the company’s first fully open-weight language models since GPT-2, and are available under the permissive Apache 2.0 license.
The gpt-oss-120b model activates 5.1 billion parameters per token using a mixture-of-experts architecture. It matches or surpasses the proprietary o4-mini on core reasoning benchmarks, while running efficiently on a single 80 GB GPU. The smaller gpt-oss-20b model activates 3.6 billion of its 21 billion parameters and can run on consumer-grade hardware with just 16 GB of memory, making it suitable for on-device inference or rapid iteration without reliance on cloud infrastructure.
Both models support advanced use cases, including chain-of-thought reasoning, tool use, and structured outputs. Developers can configure the model to apply varying levels of reasoning effort, striking a balance between speed and accuracy.
Trained using techniques adapted from OpenAI’s internal o-series models, gpt-oss models use rotary positional embeddings, grouped multi-query attention, and support 128k context lengths. They were evaluated on coding, health, math, and agentic benchmarks, including MMLU, HealthBench, Codeforces, and TauBench, showing strong performance even compared to closed models like o4-mini and GPT-4o.
Source: Open AI Blog
OpenAI released the models without applying direct supervision to their chain-of-thought (CoT) reasoning, enabling researchers to study reasoning traces for potential issues such as bias or misuse.
To assess risk, OpenAI performed worst-case scenario fine-tuning on the models using adversarial data in biology and cybersecurity. Even with strong fine-tuning efforts, the models did not reach high-risk capability levels according to OpenAI’s Preparedness Framework. Findings from external expert reviewers informed the final release. The company has also launched a red teaming challenge with a $500,000 prize pool to evaluate the models in real-world conditions further.
The models are available on Hugging Face and several deployment platforms. The 20B model can be run locally with just 16 GB of RAM. As one Reddit user asked:
Can this model be used on a computer without connecting to the internet locally? What is the lowest-powered computer (Altman says ‘high end’) that can run this model?
Another user clarified:
After downloading, you don’t need the internet to run it. As for specs: you’ll need something with at least 16GB of RAM (VRAM or system) for the 20B to ‘run’ properly. A MacBook Air with 16GB can run this at tens of tokens per second. A modern GPU hits hundreds+.
Microsoft is also bringing GPU-optimized versions of the 20B model to Windows via ONNX Runtime, making it available through Foundry Local and the AI Toolkit for VS Code.