Where Glitch Tokens Hide: Common Patterns In LLM Tokenizer Vocabularies

Where Glitch Tokens Hide: Common Patterns in LLM Tokenizer Vocabularies | HackerNoon

Last updated: 2025/05/12 at 6:41 AM

News Room Published 12 May 2025

Table of Links

Abstract and 1. Introduction

Methods

2.1 Tokenizer analysis

2.2 Indicators for detecting under-trained tokens and 2.3 Verification of candidate tokens
Results

3.1 Effectiveness of indicators and verification

3.2 Common observations

3.3 Model-specific observations
Closed-source models
Discussion, Acknowledgments, and References

A. Verification details

B. A short primer on UTF-8 encoding

C. Outputs for API-based verification

3.2 Common observations

Although many of our findings are dependent on model-specific details such as tokenizer training and configuration, model architecture, and training data, there are a number of commonalities that appear across many different model families.

3.2.1 Single-byte tokens

Tokens representing a single byte are a common source of untrained tokens. The most common occurrence are the ‘fallback’ bytes 0xF5–0xFF which are not used in UTF-8 encoded text[2], and are a convenient source for quickly locating reference untrained tokens for indicators which require them. In addition, many tokenizers including from the Gemma, Llama2 and Mistral families include every byte as a token, but additionally assign a duplicate token to many characters in the normal ASCII range 0x00–0x7F. For example, A is both token 282 as an unused byte fallback token and as token 235280 a text-based ‘A’ in the Gemma models. These issues are not universal, and we also find models which include precisely the 243 bytes used in UTF-8

Figure 2: Under-trained token indicators vs Training data. Shown are the (un)embedding-based indicators for the OLMo v1.7 7B model and the number of times each token appears in the first epoch of the training data.

as tokens, including the models by EleutherAI [14]. Untrained single byte tokens are typically classified as ‘partial UTF-8 sequences’ or ‘unreachable’, and our indicators are effective in revealing which ones are never or rarely seen in training. We publish specific tables which shows the status of each single-byte token for each analyzed model in our repository.

3.2.2 Fragments of merged tokens

3.2.3 Special tokens

Many models include untrained special tokens, such as <pad>, <unk>, or <|unused_123|>. In the following discussion we generally omit mentioning them, unless their status as an (un)trained token is particularly surprising, as their inclusion in the tokenizer and training data is typically deliberate, for purposes such as the ability to fine-tune models without changing tokenizers. One common observation is that on many occasions tokens such as <mask>, which we expect to be completely untrained, nevertheless appear to have been seen in training. A likely source for this is code repositories or guides about language models using these tokens in normal text, along with tokenizers allowing such special control tokens in normal input text.

[2] See Appendix B for a primer on UTF-8 encoding.

[3] When mentioning fragments of more complete tokens, the tokens in parentheses were not detected or verified as under-trained, unless explicitly mentioned otherwise.

Where Glitch Tokens Hide: Common Patterns in LLM Tokenizer Vocabularies | HackerNoon

Table of Links

3.2 Common observations

Leave a Reply Cancel reply

Stay Connected

Latest News

How China’s Patriotic ‘Honkers’ Became the Nation’s Elite Cyber Spies

Popular Sylvanian Drama TikTok hit with legal firestorm after month-long silence

This Solo-Built Search Engine Is Challenging Big Tech | HackerNoon

AI clones of Miss England hopefuls enter pageants with new ‘digital twin’ round

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

3.2 Common observations

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News