How to Read and Use Regular Expressions

Regular expressions, or RegEx, is a method of pattern matching for strings of text. It is utilized most often in text searches, find-and-replace operations, and for input validation. RegEx involves using a wide variety of characters, each with its own specific meaning, to form a string of those characters meant to represent a pattern that can be matched against other strings of text. Let’s look at some examples of the most common characters used in regular expressions:

The asterisk ( * ):

The asterisk is known as a repeater symbol, meaning the preceding character can be found 0 or more times. For example, the regular expression ca*t will match the strings ct, cat, caat, caaat, etc.

The plus symbol ( + ):

The plus symbol is also a repeater symbol, but instead means that the preceding character can be found 1 or more times. So, ca+t would match all of the same patterns as it would with an asterisk, except ct. ct would not be a match because the a character appears 0 times.

The wildcard ( . ):

The wildcard represents any character, and is often used in conjunction with a repeater to cover a large range of patterns. One particular case for using the wildcard with a repeater would be to target all emails under a certain domain. For example, the regular expression .*@hallme.com would match any email with the hallme.com domain, since the wildcard ( . ) means that any character is valid and the asterisk ( * ) means that they can appear any number of times. Those two characters combined produce quite a powerful matching tool, essentially saying that any string of any size is valid.

The curly braces ( { } ):

The curly braces are the last of the repeater symbols, and are used to specify a number of times a character can appear in a string. There are a couple of different uses for the curly braces, since they can be used to specify a minimum and maximum number of characters as well. For example, the regular expression ca{1}t would only match the string cat, while the regular expression ca{1,4}t would match the strings cat, caat, caaat, caaaat.

The optional character ( ? ):

The optional character means that the preceding character is exactly that, optional. This is particularly useful with file extensions or URL protocols. For example, the regular expression docx? means that both doc and docx would be a match, and the regular expression https? means that both http and https would be a match.

The caret ( ^ ) and the dollar sign ( $ ):

The caret and the dollar sign are both used to indicate position in a string. The caret symbol is used to signify that the match must start at the beginning of the string, while the dollar sign means that the match must occur at the end of the string. For example, if you wished to find all phone numbers with the area code 207, you could use the regular expression ^207. If you instead wished to find all phone numbers ending with 1234, you could use the regular expression 1234$.

Character classes:

There are many character classes that can be used for matching common characters used in strings. Some of the most common ones are:
/s : any whitespace characters (space, tab, etc.)
/S : any non-whitespace characters
/w : any word (alpha-numeric)
/W : any non-word
/d : any digit character
/D : any non-digit characters
/b : any word boundary (spaces, dashes, commas, semi-colons, etc)

The square brackets ( [ ] ):

The square brackets are used to match sets of characters, and are quite versatile. You can simply include the characters you wish to match, such as [aeo]. So with a regular expression such as b[aeo]lt, the pattern would match the words balt, belt, and bolt. Square brackets can also be used to negate characters, so a slightly different expression such as b[^aeo]lt would match every word in that pattern EXCEPT balt, belt, and bolt. The square brackets can also be used for ranges of characters. The expression [a-zA-Z] would match any alphabetic character regardless of case.

The escape symbol ( \ ):

An obvious side effect of using characters as pattern matchers is that you then can’t use the character as an actual character. The escape symbol solves this problem. If you want to use any characters that also exist as pattern matchers, simply include the escape symbol before the character. For example, the expression 2\+2 would match the string 2+2.

The parentheses ( ( ) ):

The parentheses are used to group symbols together so that they act as a single unit.

The vertical bar ( | ):

The vertical bar acts as an OR symbol, and is used to designate multiple options for a match. For example, the regular expression th(e|ere|eir|is|at) would match all of the following: the, there, their, this, and that.

It is not uncommon to read through a quick article on regular expressions and feel like you have a solid understanding of them. Then, you see an actually complex regular expression and become completely overwhelmed. Luckily, with regular expressions we can take the time to work through them character by character to figure out exactly what they mean. Let’s work through one now.

^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

Overwhelmed yet? Don’t be! Let’s work through it. Here’s a RegEx cheat sheet to help you out a bit.

We see the ^ and $ on the ends, so we can conclude that the pattern must match the full string. Next, we have a grouping that ends at the @ symbol. That grouping contains a long expression inside square brackets, which we know are used to match a specific set of characters. So, with [a-zA-Z0-9_\-\.] we can determine that our set includes all lowercase alphabetic characters (a-z), all uppercase alphabetic characters (A-Z), all digits (0-9), underscores (_), dashes (\-), and periods (\.). Also, we know that any characters that fall in that set must appear one or more times (+). So, we know that any combination of alphanumeric characters along with underscores, dashes, and periods will fall in the set for this first group. Following the group we have the @ symbol, which has no special meaning and is simply a character. After that, we have a repeat of the same grouping from before, followed by a period (\.). Then, we have another group ([a-zA-Z]{2,5}). This group specifies just the set of case-insensitive alphabetic characters ( [a-zA-Z] ), along with a range meaning that there must be between 2 and 5 of those characters ( {2-5} ). By now you’ve probably figured out what we’re looking at here, an email address! This regular expression can be used to match any email address it comes across, even for a variety of different domain types.

There is so much more that can be covered regarding regular expressions. The nature of the language allows for a potentially infinite number of combinations for what is essentially the same pattern. We could have written the above expression for matching an email address in several different ways and still have achieved the same result. Regular expressions simply require practice and patience before their true usefulness is revealed. Try working through an expression every time you see one instead of just copying and pasting it into your code. Before you know it, you’ll be writing complex regular expressions yourself. You’ll have yet another tool in your coding arsenal and you’ll be a better developer for it.