Regular Expressions

Regular Expressions

ยท

9 min read

Regular Expressions come in handy many times, be it for checking valid Email Id or for matching URLs. This blog comes with a cheat sheet to understand regex.

Setup

  1. Open your favourite code editor and open the search tool.
  2. Now turn on the .*(regex) and Aa(match case) Like this:

search tool bar.jpg

Literal Characters

So first let's search for literal characters.

literal char.jpg

Only the lowercase abc gets matched here and not the uppercase ones. This is because the match case is on. In the image below, you can see that it only matches uppercase ABC.

caps ABC.jpg

The order in which we write these characters is also important e.g. if we type bac it doesn't match abc.

bac.jpg

Meta Characters

meta characters.jpg

The highlighted characters are known as metacharacters and they need to be escaped.

Literal . (dot) and \ (backslash)

So what happens when we simply type the special character itself, without escaping it? Let's see :

all characters.jpg

(The search toolbar is on the top left corner with . in its search box.) It matches with every character in the text area, and that is because the dot . is a special character in regular expressions. To search for a literal dot ., we need to escape it with a backslash \.

literal dot.jpg

So we have learnt that to match special characters, we need to escape them with a backslash \. Special characters include: .[{()\^$|?*+

Now let us write a very simple regex to match the literal URL neog.camp, so according to the above statement, the . needs to be escaped.

neog.camp.jpg

What we have done here is simply typed the URL itself (and escaped the special character), but in real-world scenarios, we could be writing regular expressions to search for patterns. For that case, we need to write a more generic regular expression which will include using some meta-characters.

Let us take a look at the table below:

1-1.jpg

  1. The . matches all the characters except a new line as we saw here:

all characters.jpg

  1. \d matches with all the digits from 0-9.

digits.jpg

  1. \D matches with everything that is not a digit (0-9).

not a digit.jpg

When we search with a \ along with an uppercase letter, it does the opposite of what the lower case letter search does.

  1. \w matches with all the word characters along with underscore. ( a-z, A-Z, 0-9, _ )

word character.jpg

  1. \W matches with non-word characters like spaces, punctuation marks and meta characters.

Not word character.jpg

  1. \s matches with whitespace, ie. (space, tab, newline)

white space all.jpg

  1. \S matches with everything that is not whitespace.

not whitespace.jpg

Anchors

anchors.jpg

Anchors don't match any characters, but rather they match invisible positions before and after characters.

  1. \b word boundary word boundary.jpg

We have searched for word boundary \b with to. The 'to' in tokyo got matched because there is a word boundary there at the start of the line. For the second 'to' the space before the word acts as a word boundary. The last 'to' in kyoto does not get matched because there is no word boundary before it.

to no word boundary.jpg In the image above, we have removed the word boundary and simply searched for 'to' due to which all 3 to's match.

  1. \B no word boundary

no word bndry.jpg

\B matches when there is no word boundary. So, the 'to' in kyoto gets matched as it has no word boundary before it.

Let's try wrapping 'to' in a word boundary(\bto\b). Guess what will be the output before reading ahead!

๐Ÿฅ๐Ÿฅ๐Ÿฅ

b to b.jpg Only the middle to gets matched because it has word boundary before and after it as well.

  1. ^ The caret symbol ^ matches the position at the beginning of a string. Like:

string char.jpg

  1. $ The dollar symbol matches the position at the end of the string.

end at the string.jpg

Now let us see some practical examples. We will start by writing some regular expressions for matching some Indian as well as international phone numbers.

For phone numbers, we can't type in a literal search as we did before because all the numbers are different. They have a similar pattern but they all have different digits. So in this case we need to use metacharacters instead of literal characters.

Example-

International Phone Numbers:

 123-234-9809 & 321.546.9930

We have a pattern here of 3 digits, and then a - (dash) or a . (period) followed by 3 more digits and then a - (dash) or a . (period) and then 4 digits at the end.

In the cheat sheet we can see that we can match a digit using \d.

digit1.jpg \d matches all of the digits as we can see in the image above.

Now let's start by typing \d thrice which will match any 3 digits in a row.

3 digits.jpg

After matching the first 3 digits, we now need to match the - dash and . in the pattern. For now, let's match any character in the position followed by the first 3 digits. Guess what should we type in to match any character ๐Ÿค”?? Yep, as we have already learnt, it is a . (period/dot).

ph no. dash dot.jpg Now let us add the next 3 digits, and it should be simple.

ph no. 2nd batch.jpg

We have matched the first 6 digits and the 1st separator (./-) in the phone number series, now we need to match the remaining 4 digits and the 2nd separator(./-).

complete phone number.jpg As we can see the regular expression matches all 4 of our phone numbers.
Let's take a look at a more realistic example: ph no. db.jpg Here we can see how the regex we wrote comes in handy while searching for phone numbers in a database of information instead of a literal search.
Now let's be a bit more specific about the separator. Currently, our regex matches with any separator even * or anything else. But the numbers with such separators are not valid, so we need to rewrite our regex. As you can see in the image below:

ph no. hash.jpg To only match a - dash or a . dot, we need to use a [ ] character set. And inside the character set add the characters which are required. In our case, it is obvious that - and . will be added.

ph no. only dash dot.jpg Now that we replaced . with [-.] the regular expression matches only the first two phone numbers.

Inside the character set, you don't need to escape the . character.

Let's say we want to match phone numbers starting with 800 or 900, then our regular expression will change as follows:

ph no. 8 or 9.jpg

By now we know that \d matches with every digit from 0-9. What if we need to match digits only in a particular range, say 2 to 8? For that, we can use a [ ] (character set). Instead of trying to match [2345678], we can write [2-8] and it will match all digits between 2 and 8 (2 and 8 included).

range of numbers.png

Like digits, the same could be done for alphabets. Say if you want to match only lower case alphabets from 'a' to 'z' you can write [a-z]

range of alphabets.png

In the above example, the uppercase alphabets don't get matched. To match both lower and uppercase, we can simply add the uppercase range, i.e [a-zA-Z]

lower and uppercase.png

To match more characters for e.g. digits, we can write [a-zA-Z0-9]

^ inside [ ]

We know that outside of the character set [ ], ^ matches the beginning of a string. But, within the character set, it negates the set and matches everything that is not in the set.

Let us look at an example. Say we want to match every character except lower case a-z, then we can write it inside the [ ] like [^a-z].

image.png

We can see that here it matches everything that is not a lowercase letter.

Another example would be; say we want to match the words that end with 'at' except the word bat. So we can do this by putting b inside the character set, [ ] preceded by ^.

image.png

Quantifiers

{ }

image.png

Let's take the previous example again. The one with the phone numbers. To match the phone numbers we can write, \d\d\d.\d\d\d.\d\d\d\d (\d is for matching digit and . for matching any separator.)

image.png

Notice how we are repeating \d, instead, we can use quantifiers. So to match digits, we write: \d{3}.\d{3}.\d{4}

image.png

Much cleaner, isn't it? What the {3} does is, it matches exactly 3 (digits in our case). Here we are matching exact numbers, but sometimes we don't know the exact number and we may need to use one of these * + ? quantifiers.

* + ?

Mr. Sharma
Mr Smith
Ms Devi
Mrs. Brown
Mr. J

Let us try to write regex for the above names. First, let's write Regex for names starting with Mr

So we can write Mr

Mr.jpg Now some names have a dot after the initials. So we need to write a regex to check the dot.

2.jpg

The \. checks if there is a dot after Mr and the ? is for allowing either 1 dot or 0 dot. ie. It checks if there is a dot, but allows to match even if there is no dot present.

Now to match the space after Mr, Mr., we write \s

3.jpg

In the names that are provided, every name starts with a capital letter. So we need to write regex for that.

4.jpg

[A-Z] checks for any uppercase letter

5.jpg

\w checks for any word character a-z, A-Z, 0-9, _. But we want to match the characters after the second letter as well right? Can you guess how can we do that?

think

We can put an * quantifier which allows matching of 0 or more instances. As we can see the final regex we have written so far is Mr\.?\s[A-Z]\w* but it does not match for Ms, or Mrs. so let's try to incorporate those as well.

What we need is a group ( ) containing cases: After M the possible letters that are mentioned in the question are r, s and rs. So, our group will be (r|s|rs). Notice the | means either or-- either r or s or rs. The final regex will look like this:

7.jpg

Matching Emails

Example:

NeoGCamp@gmail.com
neog.camp@bootcamp.edu
neog-roc8@my-work.net

These are fake email addresses. But let's try to write a regular expression that will match all of these emails.

Coming soon...

ย