Excel has formulas for just about everything—XMATCH for finding data, IF for logic, and SUMIF for conditional totals. But you’ve probably hit a wall if you’ve ever needed to extract specific patterns from text, validate formats, or manipulate strings in complex ways. That’s where REGEX comes in.
It can handle pattern matching and text manipulation with precision. But if you pair it with the SCAN function, you can solve problems that would otherwise require multiple helper columns or nested formulas. So, if you work with messy data or need to pull specific information from text strings, these functions are worth getting familiar with.
First, let’s understand the REGEX function
It’s pattern matching that makes sense once you try it
REGEX (short for Regular Expressions) is a pattern-matching language used to find, extract, or replace text based on specific rules. Instead of searching for exact text like “apple,” you can search for patterns—like “any word starting with (a) and ending with (e)” or “all email addresses in a cell.”
REGEX is a smarter version of Excel’s FIND function, which works well for basic lookups. However, REGEX handles more complex scenarios. For instance, if you need to pull phone numbers from messy text, or you need to extract everything between parentheses, or validate email formats, then you would use REGEX.
If you’ve worked with pattern matching in other programming languages or tools, Excel’s REGEX function works similarly. They scan text for patterns you define, then perform actions based on what they find. The patterns themselves use special characters and syntax—like d for digits or [A-Z] for uppercase letters—which can look cryptic at first but become intuitive with practice.
Excel has three REGEX functions you can use
They handle different jobs well
Excel rolled out three REGEX functions in 2024: REGEXTEST, REGEXEXTRACT, and REGEXREPLACE. Each handles a different task, and understanding when to use which one makes all the difference.
REGEXTEST
It checks if a pattern exists in text and returns TRUE or FALSE. The syntax is:
=REGEXTEST(text, pattern, [mode])
- text: The cell or string you want to check.
- pattern: The REGEX pattern to search for.
- mode (optional): Controls case sensitivity. Use “i” for case-insensitive matching.
Say you have a column of product codes in the sales data spreadsheet, and you want to flag entries that contain at least three consecutive digits. You can use:
=REGEXTEST(A2, "d{3}")
If cell A2 contains “PRD-12345-X,” the function returns TRUE in column B because it found three digits in a row.
It pulls specific text from a string based on a pattern. The syntax is:
=REGEXEXTRACT(text, pattern, [mode], [instance])
- text: The source text.
- pattern: The REGEX pattern defining what to extract.
- mode (optional): Case sensitivity control (“i” for insensitive).
- instance (optional): Which match to return if there are multiple (1 for first, 2 for second, etc.).
In the sales data, column C contains customer emails mixed with other text. To extract just the email, we’ll use:
=REGEXEXTRACT(B2, "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}")
This pattern identifies standard email formats and pulls them out cleanly in column D.
REGEXREPLACE
It swaps text matching a pattern with something else. The syntax is:
=REGEXREPLACE(text, pattern, replacement, [mode], [instance])
- text: The original text.
- pattern: What to find.
- replacement: What to replace it with.
- mode (optional): Case sensitivity.
- instance (optional): Which occurrence to replace (leave blank to replace all).
If column E has phone numbers in various formats—some with dashes, some with parentheses—you can standardize them by using the following formula to strip everything except digits:
=REGEXREPLACE(C2, "[^0-9]", "")
The pattern [^0-9] means “anything that’s not a number,” and replacing it with an empty string leaves just the digits.
These three functions cover most text manipulation needs. However, you can combine them with the SCAN function to gain even more flexibility, especially for processing data across multiple rows or extracting repeating patterns.
The SCAN function makes REGEX even better
I combine both to solve complex problems
SCAN processes arrays row by row and returns cumulative results. It traverses an entire range, applying a function to each value while keeping track of what came before, unlike standard functions that work on single cells.
It uses the following syntax:
=SCAN([initial_value], array, lambda)
Let’s break down the parameters:
- initial_value (optional): To choose the starting point for your calculation. If omitted, SCAN uses the first element of the array as the initial accumulator value, so the lambda function operations start from that element.
- array: The range of cells to process.
- lambda: A custom function that defines what to do with each value. Lambda lets you write formulas like a human and takes two arguments—the accumulated result and the current value.
On its own, SCAN is useful for running totals or conditional processing. But when paired with the REGEX function, you can extract patterns from entire columns and clean messy data in one formula.
Let’s look at a practical example. In the sales data spreadsheet, column G contains transaction descriptions with amounts buried inside text—like “Payment of $250 received” or “Refund $75.50 processed.” You need just the dollar amounts.
A standard REGEXEXTRACT formula works for one cell:
=REGEXEXTRACT(D2, "$d+.?d*")
But to process the entire column, you’d normally drag the formula down hundreds of rows. With SCAN, you handle it all at once, just use:
=SCAN("", G2:G11, LAMBDA(acc, curr, REGEXEXTRACT(curr, "$[d,]+.?d*")))
This formula loops through cells G2 to G11, extracts the dollar amount from each, and returns the whole array. The accumulated value (acc) isn’t needed here since we’re just extracting, but SCAN still requires it in the lambda structure.
It proves handy when you need to build on previous results. Say column I tracks order IDs formatted inconsistently—some have prefixes, some don’t. If you want to extract the numeric portion and create a running count of valid IDs, you would use:
=SCAN(0, E2:E100, LAMBDA(count, id, IF(REGEXTEST(id, "d{4,}"), count+1, count)))
The above formula checks each cell for at least four consecutive digits using the REGEXTEST function. If found, it increments the counter; otherwise, it retains the previous count in column J. The result is a column showing cumulative valid IDs as you move down the list.
SCAN also handles multistep REGEX operations in a single formula. If you need to extract, validate, and transform text all at once, nesting the REGEX function inside the SCAN function does it without the need for helper columns. That’s particularly useful when working with large datasets, where adding extra columns can slow things down.
When to use these functions instead of standard formulas
REGEX and SCAN aren’t replacements for VLOOKUP or IF, but they’re tools for when text patterns matter more than exact matches. If you’re cleaning imported data or extracting specific pieces from unstructured text, they’ll save you time compared to manual edits or nested TEXT functions.
The syntax may look intimidating at first, but once you nail a few patterns, such as email validation, phone number extraction, and splitting names, you can reuse them across projects. Ideally, you should start with REGEXTEST to check for patterns, then move to REGEXEXTRACT when you need the actual text. Add SCAN when you need to process entire columns and need to run the results.