Reading: ByteDance unveils UltraMem architecture to reduce large model inference costs by up to 83% · TechNode

ByteDance unveils UltraMem architecture to reduce large model inference costs by up to 83% · TechNode

Last updated: 2025/11/12 at 4:19 PM

News Room Published 12 November 2025

ByteDance’s Doubao Large Model team yesterday introduced UltraMem, a new architecture designed to address the high memory access issues found during inference in Mixture of Experts (MoE) models. UltraMem boosts inference speed by two to six times and can reduce inference costs by up to 83%, according to the team. As large model sizes increase, inference costs and memory efficiency have become critical bottlenecks. UltraMem, a sparse model that decouples computation from parameters, aims to tackle these challenges while maintaining model performance. The breakthrough has been accepted for presentation at ICLR 2025 (International Conference on Learning Representations, a major AI industry event), with ByteDance saying it offers a novel approach to enhancing the efficiency and scalability of large models. [Doubao Large Model team WeChat account]