Regular Expressions are Better

Writings and Ramblings


Regular Expressions are Better

01 Nov 2022 - Jason Schlesinger

Regular Expressions are Better Than Programming and Here’s Why

The general premise is that Regular Expressions allow users to do incredibly powerful data manipulation without writing a single line of code. Teaching kids anyone to code is less valuable than teaching them regular expressions, or regex. Many modern editors support regular expressions by default, but the benefits aren’t felt by most people.

What are regular expressions?

Regular Expressions are patterns you can pass into most modern editors to support finding and replacing text with much more power and options than just looking for a bunch of characters and replacing them with another bunch of characters.

This isn’t a tutorial on regular expressions, but the gist is that rather than telling the computer exactly what you’re looking for, you’re telling the computer what the thing you’re looking for looks like. You may be familiar with using a wildcard character like asterisk in some search engines. Well, imagine that, but on steroids.

What do they look like?

I’ve been hesitant to get into the details because at first glance they can look really scary, but you’re just telling the computer the pattern you expect to see. The simplest form of this is just the text you are looking for /hello/ matches “hello” as you’d expect.

However there are special characters you can use to tell the computer you want to match something other than just a character. For example . matches any character exactly once. Then * says the preceding pattern could appear more than once, or not at all. Special characters are .*?+()[]^$\{}

If you want to use a special character, just pop a backslash out front. So /./ will match any character, but \. will match a period.

“Find” on Steroids

Knowing that we can match anything we can make a pattern for, let’s try for something a little more difficult. Say you’re looking for phone numbers in a document. You want something that looks like a phone number to match. So, what does a phone number look like?

Since you’re all following the E.123 standard, I’m sure all your phone numbers look like this:

  • +1 928 515 2267
  • (928) 515 2267

But just in case you’ve ever had someone write their number on your hand at the club, you may be familiar with:

  • 928-515-2267
  • 928.515.2267
  • 928 515-2267
  • 928 5152276
  • 1-928-515-2267

If you wanted to search all those, you’d definitely need a programmer, right?
WRONG
I’ve fooled you! You can find every one of those phone numbers using one regular expression! I’m going to go through how to find them in the next section, but if you want to try for yourself here is a site with references and a tool to test your answer to help. ___

Finding the Phone Number

Let’s break down this regular expression into chunks to make it easier to figure out what we need.

Finding the Digits

Regular expressions have “character classes” to help match groups of characters. The one we want is \d to find digits, but here’s a few

  • symbol matches
  • . anything
  • \d digit
  • \s whitespace
  • \w word character
  • \D non-digit
  • \S non-whitespace
  • \W non-word

At this point, the expression /\d\d\d.\d\d\d.\d\d\d\d/ would match most of the examples above. You can search for repeated patterns by specifying a number in {curly braces}. So, a shorter version of the pattern above is /\d{3}.\d{3}.\d{4}/

Optional Parts

Sometimes you know that something may or may not appear, in this case you can use the symbol ? to say it may or may not appear. Alternatively, if you want to skip a pattern of unknown length, you can use * to tell the computer to keep skipping that pattern until it matches anything after it. This means the expression /.*/ will match everything. Additionally, you can match a pattern one or more times using +

We can apply this to our example to get /.*\d{3}.+\d{3}.?\d{4}/ which will now match

Finding the Groups

It can be easy to collect your patterns in to groups. This both allows you to focus your pattern to the contents of the group, and also use the group contents more easily.

Putting parentheses around a regular expression marks it as a group. This makes our regular expression /.*(\d{3}).+(\d{3}).?(\d{4})/

“Replace” on Steroids

So, now you’ve found every phone number in your document, so what? Well, now we can manipulate what we’ve found using what we found. Say we want everything to use the E.123 standard for international numbers. We can use regular expressions to pass what was found into what it’s being replaced with^[The specifics of this vary between platforms, consult your documentation for more details.]

In the replace field we can just put a $ in front of the capture group to specify where to put what we found. The substitution +1 $1 $2 $3 will format all our phone numbers how we want, and we didn’t need to write a single line of code.

Then Test. Test. Test.

Say you take your regular expression and throw it at a document you’re editing, and it encounters the line:

Please contact us at +1 928 515 2276

What will your expression do? Is that the intended behavior?

You have two goals with regular expressions:

  1. Match what you want to match
  2. Don’t match what you don’t want to match Consider precisely what patterns you’re expecting, and write an expression which only matches those patterns.

Capture What you Want

Consider the point of the leading .* only serves to match the leading +1, we could create a group to capture that, and otherwise ignore it.

/(\+?1.)?\(?(\d{3})\)?.(\d{3}).?(\d{4})/

The group (\+?1.)? matches all cases of +1 and 1-, but the ‘?’ at the end says the pattern may not match anything.
In addition, surrounding the area code group with \(?...\)? tells the computer that there may or may not be parentheses around that group.

Finally, the substitution will change to +1 $2 $3 $4 as there is now an additional group.

What Next?

Congratulations! You’re on your way to becoming a regular expression master! I recommend