The Nuts And Bolts Of Token Testing: Prompt Variations And Decoding In Practice

Table of Links

Abstract and 1. Introduction

Methods

2.1 Tokenizer analysis

2.2 Indicators for detecting under-trained tokens and 2.3 Verification of candidate tokens
Results

3.1 Effectiveness of indicators and verification

3.2 Common observations

3.3 Model-specific observations
Closed-source models
Discussion, Acknowledgments, and References

A. Verification details

B. A short primer on UTF-8 encoding

C. Outputs for API-based verification

A Verification details

We use three repetitive prompts to induce models to output the candidate token we are testing:

These prompts are all designed to be suitable for base models and not require specialized instruction tuning or prompt templating. For each prompt we generate three tokens and check the maximal probability of our target token being predicted, and then take the maximum of this again over all three prompts. Variation in quoting and spacing helps to ensure we do not detect false positives based on models producing similar tokens without spaces, or tokens which start with punctuation partially merging with quotes.

B A short primer on UTF-8 encoding

UTF-8 is the most prevalent encoding scheme used to represent text in computers and communication protocols worldwide. It efficiently encodes Unicode characters, which encompass a vast range of characters from various writing systems and symbols [32]. Encoding to UTF-8 is often the first step in tokenization.

UTF-8 encoding can be summarized as follows:

• ASCII (Unicode below 128): Single byte, binary 0xxxxxxx representing up to 7 bits.

• 2-byte sequences: 110xxxxx 10xxxxxx representing up to 11 bits.

• 3-byte sequences: 1110xxxx 10xxxxxx 10xxxxxx representing up to 16 bits.

• 4-byte sequences: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx representing up to 21 bits.

Where the bits indicated by ‘x’ are concatenated to form the Unicode codepoint.

• 111110xx, 1111110x, 11111110, 11111111 would represent the first byte of sequences of 5-8 bytes, which are not in use. This corresponds to decimal 245-255 or hexadecimal F5-FF.

• 11000000, 11000001 are not in use, as the possible two-byte encodings that start with this fit in 7 bits due to the five leading zeros. These are 192/193 in decimal and C0/C1 in hexadecimal.

• Additionally, other starting bytes can be covered entirely by other tokens, and also turn out to be unused. A common example of this is C2/C3 which are only used for Unicode points 128-255. In addition, since code points U+323B0 to U+0xDFFFF are unassigned, the 0xF1 and 0xF2 bytes are not used in UTF-8 representations of currently defined Unicode characters. Similarly, 0xF4 is only used through the “Supplementary Private Use Area”. However, even if not defined in the current Unicode standard, such characters can be easily inserted in text and are found on web pages.

C Outputs for API-based verification

We use the following prompt for API based testing of under-trained tokens.

Where the strings consist of the problematic token, occasionally prefixed to help identify their source, and to avoid leading spaces, as we noticed that models often fail to correctly repeat such tokens for other reasons. Although many other prompt formats are effective, we have found this code-based approach to more clearly avoid false positives.

Figure 4 shows the result for Mistral, Anthropic and OpenAI models.

(a) Mistral API prompting results.

(b) Claude API prompting results.

(d) GPT-4 API prompting results.

Figure 4: API prompting results.

The Nuts and Bolts of Token Testing: Prompt Variations and Decoding in Practice | HackerNoon

Table of Links

A Verification details

B A short primer on UTF-8 encoding

C Outputs for API-based verification

Leave a Reply Cancel reply

Stay Connected

Latest News

Watch monstrous black hole GOBBLE up star before Earth-shattering explosion

5 Signs Your Organization Has Outgrown Legacy eSignature Solutions

Musk’s X must face claim of negligence over child abuse images, judge rules

Reddit Shifting Towards Search as Company Wants to Become a Search Engine

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

A Verification details

B A short primer on UTF-8 encoding

C Outputs for API-based verification

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News