By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: New AI Method Lets Models Decide What to Think About | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > New AI Method Lets Models Decide What to Think About | HackerNoon
Computing

New AI Method Lets Models Decide What to Think About | HackerNoon

News Room
Last updated: 2025/02/22 at 2:34 PM
News Room Published 22 February 2025
Share
SHARE

Authors:

(1) David Raposo, Google DeepMind and with equal contribution;

(2) Sam Ritter, Google DeepMind;

(3) Blake Richards, Google DeepMind and McGill University & Mila;

(4) Timothy Lillicrap, Google DeepMind;

(5) Peter Conway Humphreys, Google DeepMind;

(6) Adam Santoro, Google DeepMind and with equal contribution.

Editor’s note: this is part 2 of 5 of a study detailing a way to make transformer-based language models more efficient by dynamically allocating computational resources. Read the rest below.

Table of Links

  1. Introduction
  2. Background
  3. Implementing Mixture-of-Depths Transformers
    • 3.1. Defining a compute budget

    • 3.2. Routing around transformer blocks

    • 3.3. Routing schemes

    • 3.4. Routing implementation

    • 3.5. Sampling and 3.6. Training methods

  4. Results
    • 4.1. Training, isoFLOP comparisons
    • 4.2. Auto-regressive Evaluation and 4.3. Mixture-of-Depths-and-Experts (MoDE)
  5. Discussion and References

2. Background

The transformer architecture has become the workhorse of a revolution in practical artificial intelligence, bringing unprecedented capabilities at the cost of expensive training runs and serving procedures. This has spurred tremendous interest in making transformer architectures more efficient (Gupta and Agrawal, 2021; Tay et al., 2020). One of the promising approaches is conditional computation, whereby learned mechanisms determine when and how to expend computation. This terminology was introduced by Bengio (2013), and the concept was explored further over the next several years (Bengio et al., 2016, 2013; Cho and Bengio, 2014; Graves, 2016; Jernite et al., 2017; Wang et al., 2017).

A wide variety of recent work has developed conditional computation methods for transformers. Some of this work focuses on “early exiting”, that is, learning to decide when to end computation on a given token, allowing the token to skip any remaining transformer layers after the exit decision is made (Elbayad et al., 2019; Liu et al., 2021; Schuster et al., 2022). In MoD, unlike in early-exit methods, a token can skip middle layers, then be updated via self-attention with tokens that that have gone through all the middle layers. We speculate that this might be a useful property.

Other work has developed methods for iterating transformer layers with shared weights for an adaptive number of steps (Dehghani et al., 2018; Simoulin and Crabbé, 2021). Bolya et al. (2023) developed a method for choosing tokens to merge when running inference on a trained vision transformer which notably requires no learning. Lei et al. (2023) conditional computation in a fine tuning setting by building on adapter approaches (He et al., 2021) to learn to skip blocks of frozen pre-trained weights in favor of running only a small fine-tuned adapter.

CoLT5 (Ainslie et al., 2023) uses conditional routing to select whether a given token will pass through a heavy or light pathway for each feedforward layer. Further, they use the same routing mechanism to select whether a token will attend to all other tokens or to a select few, as in Guo et al. (2022). Like MoD, CoLT5 uses soft top-k for making routing decisions. However, CoLT5 focuses on a encoder-decoder setting, and thus does need to contend with the problem of efficient sequential decoding given the non-causal nature of the top-k operation. In contrast, our current work with

Figure 1 | Mixture-of-Depths Transformer. As in mixture-of-experts (MoE) transformers we use a router to choose among potential computational paths. But unlike in MoE transformers the possible choices are a standard block’s computation (i.e., self-attention and MLP) or a residual connection. Since some tokens take this second route, Mixture-of-Depths (MoD) transformers have a smaller total FLOP footprint compared to vanilla or MoE transformers. On the top right is depicted a trained model’s routing decisions for a short sequence truncated to 64 tokens for visualization purposes. When examining the choices one can find tokens processed by later blocks’ layers, despite passing through relatively few total blocks throughout the model’s depth. This is a unique feature of MoD compared to conventional halting-based, or "early-exit" conditional computation, which instead engage blocks serially, or vanilla transformers, which engage every block.Figure 1 | Mixture-of-Depths Transformer. As in mixture-of-experts (MoE) transformers we use a router to choose among potential computational paths. But unlike in MoE transformers the possible choices are a standard block’s computation (i.e., self-attention and MLP) or a residual connection. Since some tokens take this second route, Mixture-of-Depths (MoD) transformers have a smaller total FLOP footprint compared to vanilla or MoE transformers. On the top right is depicted a trained model’s routing decisions for a short sequence truncated to 64 tokens for visualization purposes. When examining the choices one can find tokens processed by later blocks’ layers, despite passing through relatively few total blocks throughout the model’s depth. This is a unique feature of MoD compared to conventional halting-based, or "early-exit" conditional computation, which instead engage blocks serially, or vanilla transformers, which engage every block.

MoD focuses on the decoder-only setting, and so we propose a predictive router to enable efficient inference for conditional computation in transformers.

One successful formulation of conditional computation is the the “mixture-of-experts” layer (MoE) as introduced by Shazeer et al. (2017). Developed initially in the context of LSTMs, later work showed compelling empirical results for MoE with transformers (Fedus et al., 2022; Lepikhin et al., 2020; Zoph et al., 2022). Unlike other conditional computation approaches that try to conserve or expend additional compute, MoE transformers use conditional logic to route tokens to one of many expert MLPs while keeping total compute expenditure constant. Our mixture-of-depths method can be thought of as using the routing logic from MoE transformers, but rather than having multiple experts, MoD deploys a single expert which can be dynamically skipped.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Horrifying black seadevil fish seen in first-ever clear daytime video
Next Article Manchester UK Apple Store expanding to larger premises
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Tariffs or No, Apple May Raise iPhone Prices This Year
News
DIY Chatbots Like It’s 2021: A Python Throwback Tutorial | HackerNoon
Computing
Quell the Heat With Our Favorite Window Air Conditioners
Gadget
Artists call on trade unions to back AI protections – UKTN
News

You Might also Like

Computing

DIY Chatbots Like It’s 2021: A Python Throwback Tutorial | HackerNoon

7 Min Read
Computing

Upselling Website Creation with Domains: A Step-by-Step Guide

15 Min Read
Computing

Tencent’s Supercell earned nearly $3 billion in 2024, up 77% y-o-y · TechNode

1 Min Read
Computing

Windsurf Shakes Up AI Coding Tool Market With Generous Free Tier and GPT-4.1 Access | HackerNoon

7 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?