8 min read

Regular Expression 101 for Programmers

Most regular expression guides teach beginners. This one teaches experienced programmers unfamiliar with regular expression.

This article is part of my web development series 'webDEV101'. You can read the whole series here:

#webDEV101
Rest assured, it’s easy.

Regular Expression


Regular expressions are used to match strings.

At some point in your web dev or general coding journey, you will inevitably encounter a situation where you want to find and replace in bulk. For instance, if you just realised you forgot to enclose all your Katex code with \(\) inside your HTML before deployment, you will want to be able to find all KaTex codes and enclose them in bulk.

The problem is that KaTex codes are not fixed strings. You can't just search for them with literal text replacement. That's where regular expressions come in. They supercharge your find-and-replace so it can search for not just text but also text patterns, taking over the otherwise tedious manual find-and-replace process.

In addition to finding and replacing, Regex can also help you search for information efficiently. For instance, if you have a large amount of text, and you are interested in what names are mentioned for how many times, you can construct a Regex to search for names and return all matches and their count in one go.

The variations


Regular expression, or Regex for short, is supported natively in many text editors, programming languages, and file managers, such as the built-in editor in VSCode, Javascript, and Total Commander file manager.

There are a handful of variations in circulation today:

Abbreviation Full name Used by
POSIX Portable Operating System Interface UNIX and associated CLI tools
PCRE Perl Compatible Regular Expressions Perl, PHP, R, and many others
RE2 Regular Expressions 2 Google's software and libraries
Oniguruma Oniguruma Regular Expressions Ruby, PHP, and others
ECMA ECMAScript® Language Specification Javascript and many web applications

The POSIX standard can be further divivded into:

Abbreviation Full name Used by
ERE Extended Regular Expressions Some POSIX-compliant systems
BRE Basic Regular Expressions All POSIX-compliant systems

The bad news is two of your favourite apps can be using different standards.

The good news is most of those standards are similar.

This tutorial focuses on the ECMA standard, which is widely used in web development and modern text editors.

ECMAScript Language Specification - ECMA-262 Edition 5.1

The basics


Simple pattern

string matches for the string itself. For instance:

abc matches for 'Let's sing abc!'.


[list of characters] matches for any ONE of the characters in the list. For instance:

[abc] matches for a, b, or c;

[a-c] is equivalent to [abc];

[ac-] matches a, c, or -;

[^abc] matches for anything except a,b, or c.


. is the wildcard that matches anything except line terminators like \n and \r;

\d matches for numerals. B\d for instance, matches for B1, B2, B3, ...;

\w matches for any alphanumerical character, including letters a-z and \d numerals;

\s matches for whitespace characters, including tabs and line terminators like \r and \n;

Press the SPACE key on your keyboard and match literally, if you just wish to match a single whitespace. For instance He is handsome matches for 'He is handsome and young!'

For \d , \w, and \s, you can use the capitalised correspondent to match for anything EXCEPT that class. \D matches for any character except numerals; \W matches for any character except alphanumerals; while \S matches for any character except whitespaces.

\t matches for a tab;

\n matches for a line feed, also called a newline character, which is created when you press ENTER in most text editors. It is usually invisible.

For instance, if you wish to match for:

'Hello
World!'

You would search for Hello\nWorld!

A|B, which is called disjunction, matches for either A or B. For instance, Green|Red matches 'Green Apples' and 'Red Apples'. You can use disjunctions as many times as you like in parallel, like in A|B|C|D|E|F.

Assertions

^String matches the String only when it is at the beginning of the text. The official name for this ^is input boundary beginning assertion. In many systems, a multiple line (m) flag modifies the condition to the beginning of a new line, such as after a \n. We will discuss flags later.

For instance, ^Hello matches only the first 'Hello' in 'Hello! Hello'

String$ matches the String only when it is at the end of the text. The official name for this ^is input boundary end assertion. In many systems, a multiple line (m) flag modifies the condition to the end of a new line, such as after a \n. We will discuss flags later.

For instance, ^Hello matches only the last 'Hello' in 'Hello! Hello'

\b is called word boundary assertion. \boo would match 'oops' but not '🦘kangaroo'. oo\b would match '🦘kangaroo' but not 'oops'.

\B, similar to \D,\W,and \S are the inverse of \d,\w, and \s , is the inverse of \b.


x(?=y) is a lookahead assertion. It matches only for 'x's followed by a 'y'.

x(?!y) is a negative lookahead assertion. It matches only for 'x's NOT followed by a 'y'.

(?<=y)x is a lookbehind assertion. It matches only for 'x's PRECEDED by a 'y'.

(?<!y)x is a negative lookbehind assertion. It matches only for 'x's NOT preceded by a 'y'.

Quantifiers

You will most likely wish to match for more than a few characters. Quantifiers can help.

Quantifier Minimum Maximum
? 0 1
* 0 Infinity
+ 1 Infinity
{count} count count
{min,} min Infinity
{min,max} min max

Quantifiers should be specified immediately after a character or character group1. For instance, A\S* matches for any single word starting with 'A'

1: A character group is a set of characters, denoted by opening and closing parenthesis (). (abc){3} will match any 'abcabcabc' in the target text. More on groups later.

For the {count}{min,}, and {min,max} syntaxes, there should not be any space around the numbers, or they will be matched as literal characters instead.

Flags

(?i) ignore letter case. (?i)Apple matches for 'apple'.

(?g) matches globally. The default in some Regex systems is to terminate matching after the first match. (?g) ensures all matches are returned.

(?m) matches multiple lines. As explained before, ^String matches the String only when it is at the beginning of the TEXT, while String$ matches the String only when it is at the end of the TEXT. (?m) changes the condition to the beginning or end of each LINE.

(?s) allows . to match newline characters \n.

Groups

Groups, denoted with round brackets (group), are used to isolate a snippet within the Regex, to apply local flags or to capture it for future reference.

For instance, in a text editor find and replace:

Hello, I am John! john! Johnathan!
Find: (\bJohn\b)
Replace with: Crazy$1!

-> Hello, I am CrazyJohn! ! john! Johnathan!

Hello, I am John! john! Johnathan!
Find: (?i:\bJohn\b)
Replace with: Crazy$1!

-> Hello, I am Crazy$1! ! Crazy$1! ! Johnathan!

Hello, I am John! john! Johnathan!
Find: (?<name>(?i:\bJohn\b))
Replace with: Crazy${name}!

-> Hello, I am CrazyJohn! ! Crazyjohn! ! Johnathan!

(\bJohn\b)is an unnamed capturing group. () simply tells Regex this is a group.\bJohn\b is the string pattern. Recall \b is word boundary, so this string pattern matches for any 'John' that is one word. $1 refers back to the match ('John' for the 1st match, 'john' for the 2nd) later on in the Regex or the replace-with field. If there are multiple capturing groups, the match from the first group is referred back to with $1, the second $2, so on so forth.

Note: if the global flag (?g) is enabled, and there are multiple matches for a given capturing group, some Regex systems will return $1 as a list of all matches, while some will only return the last match.
For instance, if we apply (?g)(\bJohn\b) to Hello, I am John! john! Johnathan!, some systems will return $1=\['John', 'john'\], while other systems will return $1='john'.
For this reason, the Replace All function in text editors, which is essentially Replace with an invisible (?g), can cause unexpected behaviours.

(?i:\bJohn\b)is a modified non-capturing group. As seen, the matches has not been registered as $1The purpose of this group is not to enable future reference, but to specify with the i flag that this group specifically should be matched with case insensitivity, even though other groups may or may not.

(?<name>(?i:\bJohn\b))is a named capturing group that nested a modified non-capturing group. Modifiers can not be applied directly to a capturing group, so this nesting structure is necessary to modify a capturing group. <name> is the group name, used to refer back to the group, so in addition to$1, we can also refer to it by ${name}.

Note: Not all text editor Find and Replace support named groups.

Use in Javascript


Javascript subscribes to the ECMA standard.

A constant Regex:

const myRe = /\bA.+\b/g;
const myArray = myRe.exec("MyApple!");
console.log(`The value of lastIndex is ${myRe.lastIndex}`);

// "The value of lastIndex is 7"

Constant Regexs are denoted with /Regex/flags. In this example the string pattern is \bA.+\b, and g is the global flag.

Think: do you understand why the last index is 7? (Hint: what is \bA.+\b matching for?)

A dynamic Regex constructor function, to be used when the Regex will change during runtime, based on internal conditions, or from another source, such as user input:

const re = new RegExp("ab+c");

Find out more:

Regular expressions - JavaScript | MDN
Regular expressions are patterns used to match character combinations in strings. In JavaScript, regular expressions are also objects. These patterns are used with the exec() and test() methods of RegExp, and with the match(), matchAll(), replace(), replaceAll(), search(), and split() methods of String. This chapter describes JavaScript regular expressions. It provides a brief overview of each syntax element. For a detailed explanation of each one’s semantics, read the regular expressions reference.

Remarks

If you followed this tutorial, you should now be familiar with regular expression syntax and basic methods. To learn more, consider reading the MDN Web Docs. GitHub Copilot can also answer specific questions you may have on regular expression!

Énoncé du droit d'auteur


Much of our content is freely available under the Creative Commons BY-NC-ND 4.0 licence, which allows free distribution and republishing of our content for non-commercial purposes, as long as Ronzz.org is appropriately credited and the content is not being modified materially to express a different meaning than it is originally intended for. It must be noted that some images on Ronzz.org are the intellectual property of third parties. Our permission to use those images may not cover your reproduction. This does not affect your statutory rights.

Nous mettons la plupart de nos contenus disponibles gratuitement sous la licence Creative Commons By-NC-ND 4.0, qui permet une distribution et une republication gratuites de notre contenu à des fins non commerciales, tant que Ronzz.org est correctement crédité et que le contenu n'est pas modifié matériellement pour exprimer un sens différent que prévu à l'origine.Il faut noter que certaines images sur Ronzz.org sont des propriétés intellectuelles de tiers. Notre autorisation d'utiliser ces images peut ne pas couvrir votre reproduction. Cela n'affecte pas vos droits statutaires.