In a recent paper, Microsoft researchers described BitNet b1.58 2B4T, the first LLM to be natively trained using “1-bit” (technically, 1-trit) weights, rather than being quantized from a model trained with floating point weights. According to Microsoft, the model delivers performance comparable to full-precision LLMs of similar size at a fraction of the computation cost and hardware requirements.
While LLMs have shows impressive performance, there are still barriers to their broader adoption:
State-of-the-art open LLMs typically require large memory footprints, consume considerable energy, and exhibit notable inference latency, rendering them impractical for many edge devices, resource-constrained environments, and real-time applications.
To overcome these limitations, the LLM community has been exploring quantized models, which are derived from full-precision models by converting their weights to a lower-bit format.
Microsoft trained BitNet b1.58 2B4T from scratch on a 4 trillion token corpus using 1-bit weights, aiming to avoid the precision loss typically caused by quantizing a model originally trained in full precision, while retaining the benefits of smaller weights in terms of memory footprint and computational cost.
Indeed, based on Microsoft benchmarks, the new model performs comparably to leading open-weight, full-precision models of similar size across a wide range of tasks, including language understanding and reasoning, world knowledge, reading comprehension, math and code, and instruction following and conversation. The comparative benchmark results are summarized in the chart below:
Where BitNet b1.58 2B4T stands out compared to quantized models of similar or smaller size is in memory footprint, latency, and energy consumption, as shown in the following table.
Architecturally, BitNet b1.58 2B4T replaces standard full-precision linear layers (i.e. torch.nn.Linear
), with custom BitLinear layers, which use 1.58-bit representations to encode weights as ternary values (trits) during the forward pass.
This is achieved using an absolute mean (
absmean
) quantization scheme, which maps weights to ternary values{−1, 0, +1}
. This drastically reduces the model size and enables efficient mathematical operations.
Two additional techniques used in BitLinear layers— activation quantization and normalization— further contribute to reducing the model’s size and improving training stability.
In addition to BitLinear layers, BitNet b1.58 2B4T incorporates several established LLM techniques, such as squared ReLU activation functions, rotary positional embeddings, and bias term removal.
For training, BitNet b1.58 2B4T relies on three techniques: large-scale pre-training, supervised fine-tuning, and direct preference optimization. The researchers note that more advanced techniques, such as Proximal Policy Optimization or Group Relative Policy Optimization, will be explored in the future to enhance mathematical capabilities and chain-of-thought reasoning.
Given the unique quantization scheme of BitNet b1.58 2B4T, the model cannot be used with standard deep learning libraries like llama.cpp and requires a specialized kernel. To this aim, Microsoft has developed an open-source dedicated inference library, bitnet.cpp. Based on llama.cpp,
bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU (with NPU and GPU support coming next).
The researchers note that current GPU hardware is not optimized for 1-bit models and that further performance gains could come from incorporating dedicated logic for low-bit operations. Future research directions include training larger models, adding multi-lingual capabilities and multi-modal integration, and extending the context window length.