By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Where Glitch Tokens Hide: Common Patterns in LLM Tokenizer Vocabularies | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Where Glitch Tokens Hide: Common Patterns in LLM Tokenizer Vocabularies | HackerNoon
Computing

Where Glitch Tokens Hide: Common Patterns in LLM Tokenizer Vocabularies | HackerNoon

News Room
Last updated: 2025/05/12 at 6:41 AM
News Room Published 12 May 2025
Share
SHARE

Table of Links

Abstract and 1. Introduction

  1. Methods

    2.1 Tokenizer analysis

    2.2 Indicators for detecting under-trained tokens and 2.3 Verification of candidate tokens

  2. Results

    3.1 Effectiveness of indicators and verification

    3.2 Common observations

    3.3 Model-specific observations

  3. Closed-source models

  4. Discussion, Acknowledgments, and References

A. Verification details

B. A short primer on UTF-8 encoding

C. Outputs for API-based verification

3.2 Common observations

Although many of our findings are dependent on model-specific details such as tokenizer training and configuration, model architecture, and training data, there are a number of commonalities that appear across many different model families.

3.2.1 Single-byte tokens

Tokens representing a single byte are a common source of untrained tokens. The most common occurrence are the ‘fallback’ bytes 0xF5–0xFF which are not used in UTF-8 encoded text[2], and are a convenient source for quickly locating reference untrained tokens for indicators which require them. In addition, many tokenizers including from the Gemma, Llama2 and Mistral families include every byte as a token, but additionally assign a duplicate token to many characters in the normal ASCII range 0x00–0x7F. For example, A is both token 282 as an unused byte fallback token and as token 235280 a text-based ‘A’ in the Gemma models. These issues are not universal, and we also find models which include precisely the 243 bytes used in UTF-8

Table 1: Detection of under-trained tokens. #Confirmed are the confirmed/tested numbers for the tokens tested in verification that are predicted with a maximal probability of <1% across verification prompts. Examples were manually chosen for readability, similarity across models or for being particularly striking. Note that the leading ‘_’ in tokens such as _SolidGoldMagikarp indicates a leading space.∗We use an unembedding-based indicator for these models (cf. section 3.3.2)Table 1: Detection of under-trained tokens. #Confirmed are the confirmed/tested numbers for the tokens tested in verification that are predicted with a maximal probability of <1% across verification prompts. Examples were manually chosen for readability, similarity across models or for being particularly striking. Note that the leading ‘_’ in tokens such as _SolidGoldMagikarp indicates a leading space.∗We use an unembedding-based indicator for these models (cf. section 3.3.2)

Figure 2: Under-trained token indicators vs Training data. Shown are the (un)embedding-based indicators for the OLMo v1.7 7B model and the number of times each token appears in the first epoch of the training data.Figure 2: Under-trained token indicators vs Training data. Shown are the (un)embedding-based indicators for the OLMo v1.7 7B model and the number of times each token appears in the first epoch of the training data.

as tokens, including the models by EleutherAI [14]. Untrained single byte tokens are typically classified as ‘partial UTF-8 sequences’ or ‘unreachable’, and our indicators are effective in revealing which ones are never or rarely seen in training. We publish specific tables which shows the status of each single-byte token for each analyzed model in our repository.

3.2.2 Fragments of merged tokens

3.2.3 Special tokens

Many models include untrained special tokens, such as <pad>, <unk>, or <|unused_123|>. In the following discussion we generally omit mentioning them, unless their status as an (un)trained token is particularly surprising, as their inclusion in the tokenizer and training data is typically deliberate, for purposes such as the ability to fine-tune models without changing tokenizers. One common observation is that on many occasions tokens such as <mask>, which we expect to be completely untrained, nevertheless appear to have been seen in training. A likely source for this is code repositories or guides about language models using these tokens in normal text, along with tokenizers allowing such special control tokens in normal input text.


[2] See Appendix B for a primer on UTF-8 encoding.

[3] When mentioning fragments of more complete tokens, the tokens in parentheses were not detected or verified as under-trained, unless explicitly mentioned otherwise.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Jackery’s monster power station is over $500 off right now
Next Article How to edge a lawn with a grass trimmer
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Popular furniture chain ‘better than Walmart’ to open 8 stores – see full list
News
Why Pay More? Grab the Best Prepaid Plans Under Rs 400 Right Now
Mobile
Alibaba sees steady revenue increase, GMV and order numbers on Taobao and Tmall return to double-digit growth track · TechNode
Computing
Premier League Soccer: Stream Leicester vs. Ipswich Live From Anywhere
News

You Might also Like

Computing

Alibaba sees steady revenue increase, GMV and order numbers on Taobao and Tmall return to double-digit growth track · TechNode

1 Min Read

Baidu expects robotaxi unit economics to break even by 2024, profit by 2025 · TechNode

3 Min Read
Computing

Huawei unveils MateBook 14 with stylus support and Intel Core Ultra 7 processor · TechNode

1 Min Read

Huawei responds to AI demo controversy, confirms real-time image generation with open-source model · TechNode

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?