Related Articles
- Date, Time & Timestamp
- Integer, Float and Bit
- Enums and Sets
- Strings (this one)
Introduction
Strings are all over databases, from emails and usernames to messages, logs, and data. It appears easy to use strings in MySQL at first: simply choose VARCHAR
or TEXT
and go on. In practice, string handling conceals some of the most imperceptible and harmful dangers. Unlike numbers or dates, string behavior heavily depends on encodings and data/index storage formats.
This article is part of a series about MySQL data types. It covers string types, how they function, what may go wrong, and how to make safe choices.
Overview of String Types
Character types:
CHAR(n)
: fixed-length stringVARCHAR(n)
: variable-length stringTEXT
family:TINYTEXT
,TEXT
,MEDIUMTEXT
,LONGTEXT
Binary types:
BINARY(n)
: fixed-length binaryVARBINARY(n)
: variable-length binaryBLOB
family:TINYBLOB
,BLOB
,MEDIUMBLOB
,LONGBLOB
MySQL has two sorts of strings: character and binary. Character types are read based on a charset and collation, but binary types only store raw bytes without any constraints.
When you use fixed-width types like CHAR(n)
or BINARY(n)
, they always use the full length and fill in the gaps with spaces or zeros. This makes storage predictable, but it can waste capacity.
Variable-width types, such as VARCHAR(n)
and VARBINARY(n)
, only utilize as many bytes as they need plus a little extra space for length. They work better for content with varied lengths, but they take a little longer to update when the values change in size.
The (n)
for variable-length types tells you how many characters (for character types) or bytes (for binary types) can be in the type. The actual storage depends on the content; shorter values take up less space, but no value can be bigger than that. VARCHAR(50)
in utf8mb4
can hold up to 50 characters, but it may need up to 200 bytes of space inside because each character can occupy up to 4 bytes.
Pitfalls and Lessons
Encodings and Collations
utf8
is not full UTF‑8: MySQL’sutf8
stores only 3-byte characters, so emojis and some symbols fail. Always useutf8mb4
.- Case sensitivity (
_ci
vs_cs
) and locale rules can surprise you. The most common examples are different handling ofß
vsss
, or TurkishI
vsi
. It’s better to choose modern collations likeutf8mb4_0900_ai_ci
. - In many collations,
'abc' = 'abc '
evaluates as true, which may affect uniqueness constraints. - MySQL does not normalize Unicode.
é
as one codepoint ande
+´
as two codepoints look the same but compare differently. Normalize strings at the application layer.
-- Example: trailing spaces ignored
table> CREATE TABLE t (val VARCHAR(10) UNIQUE);
table> INSERT INTO t VALUES ('abc');
table> INSERT INTO t VALUES ('abc ');
ERROR 1062 (23000): Duplicate entry 'abc' for key 'val'
Storage and Indexing
CHAR
vsVARCHAR
:CHAR
pads values with spaces and can waste storage. Use it only for fixed-length tokens (like country codes).- Index length limits: In InnoDB, indexes are limited to 3072 bytes. With
utf8mb4
(up to 4 bytes per char),VARCHAR(1000)
cannot be fully indexed.
-- Example: index too long
table> CREATE TABLE users (
name VARCHAR(1000),
INDEX(name)
) CHARSET=utf8mb4;
-- ERROR 1071: Specified key was too long
- Row size limit: A row cannot exceed 65,535 bytes. Multiple wide
VARCHAR
s can hit this limit even if values are usually short. - Off-page storage: Very long
VARCHAR
orTEXT
values live off-page, adding I/O overhead when fetched frequently.
Text vs Binary Semantics
- Storing hashes or UUIDs in
VARCHAR
means collation rules apply, breaking equality. UseBINARY(16)
for UUID bytes orVARBINARY
for hashes. LENGTH()
counts bytes;CHAR_LENGTH()
counts characters. Mixing them causes subtle bugs with multi-byte charsets.
SELECT LENGTH('🙂'), CHAR_LENGTH('🙂');
-- LENGTH = 4, CHAR_LENGTH = 1
Search and Sorting
LIKE 'abc%'
can use an index, butLIKE '%abc%'
cannot.WHERE LOWER(name) = 'x'
kills index use. Instead, choose a case-insensitive collation or use a generated lowercase column with an index.- FULLTEXT indexes are good for natural-language search but don’t behave like exact matches.
- Sorting large text fields with complex collations often spills to on-disk temp tables.
Application–Database Boundary
- If the client connection is
latin1
but data is UTF‑8, characters get corrupted. Always configure connections to useutf8mb4
. - Schema charset, column collations, and client settings must all align.
-- Safe session setup
SET NAMES utf8mb4 COLLATE utf8mb4_0900_ai_ci;
Best Practices
- Use
utf8mb4
everywhere. - Default to
utf8mb4_0900_ai_ci
(MySQL 8.0). - Prefer
VARCHAR
for text,CHAR
only for fixed tokens. - Store hashes and UUIDs in binary columns.
- Normalize Unicode in application code.
- Watch index byte limits; use prefix indexes or generated columns when needed.
- Use FULLTEXT for natural language queries.
Conclusion
String types may seem easy, but they can create the worst issues, including text that doesn’t make sense, broken uniqueness, delayed queries, or comparisons that aren’t what you intended. The problems are caused by how MySQL deals with encodings, collations, and storage restrictions. n n You may avoid problems before they happen in production by using safe defaults (utf8mb4
, modern collations, and the right use of VARCHAR
/BINARY
) and knowing what to look out for.