Regular Expressions come in handy many times, be it for checking valid Email Id or for matching URLs. This blog comes with a cheat sheet to understand regex.
Setup
- Open your favourite code editor and open the search tool.
- Now turn on the
.*
(regex) andAa
(match case) Like this:
Literal Characters
So first let's search for literal characters.
Only the lowercase abc
gets matched here and not the uppercase ones. This is because the match case is on. In the image below, you can see that it only matches uppercase ABC.
The order in which we write these characters is also important e.g. if we type bac
it doesn't match abc
.
Meta Characters
The highlighted characters are known as metacharacters and they need to be escaped.
Literal . (dot) and \ (backslash)
So what happens when we simply type the special character itself, without escaping it? Let's see :
(The search toolbar is on the top left corner with .
in its search box.) It matches with every character in the text area, and that is because the dot .
is a special character in regular expressions. To search for a literal dot .
, we need to escape it with a backslash \
.
So we have learnt that to match special characters, we need to escape them with a backslash
\
. Special characters include:.[{()\^$|?*+
Now let us write a very simple regex to match the literal URL neog.camp, so according to the above statement, the .
needs to be escaped.
What we have done here is simply typed the URL itself (and escaped the special character), but in real-world scenarios, we could be writing regular expressions to search for patterns. For that case, we need to write a more generic regular expression which will include using some meta-characters.
Let us take a look at the table below:
- The
.
matches all the characters except a new line as we saw here:
\d
matches with all the digits from 0-9.
\D
matches with everything that is not a digit (0-9).
When we search with a
\
along with an uppercase letter, it does the opposite of what the lower case letter search does.
\w
matches with all the word characters along with underscore. ( a-z, A-Z, 0-9, _ )
\W
matches with non-word characters like spaces, punctuation marks and meta characters.
\s
matches with whitespace, ie. (space, tab, newline)
\S
matches with everything that is not whitespace.
Anchors
Anchors don't match any characters, but rather they match invisible positions before and after characters.
\b
word boundary
We have searched for word boundary \b
with to. The 'to' in tokyo got matched because there is a word boundary there at the start of the line. For the second 'to' the space before the word acts as a word boundary. The last 'to' in kyoto does not get matched because there is no word boundary before it.
In the image above, we have removed the word boundary and simply searched for 'to' due to which all 3 to's match.
\B
no word boundary
\B
matches when there is no word boundary. So, the 'to' in kyoto gets matched as it has no word boundary before it.
Let's try wrapping 'to' in a word boundary(\bto\b
). Guess what will be the output before reading ahead!
๐ฅ๐ฅ๐ฅ
Only the middle to gets matched because it has word boundary before and after it as well.
^
The caret symbol^
matches the position at the beginning of a string. Like:
$
The dollar symbol matches the position at the end of the string.
Now let us see some practical examples. We will start by writing some regular expressions for matching some Indian as well as international phone numbers.
For phone numbers, we can't type in a literal search as we did before because all the numbers are different. They have a similar pattern but they all have different digits. So in this case we need to use metacharacters instead of literal characters.
Example-
International Phone Numbers:
123-234-9809 & 321.546.9930
We have a pattern here of 3 digits
, and then a -
(dash) or a .
(period) followed by 3 more digits and then a -
(dash) or a .
(period) and then 4 digits at the end.
In the cheat sheet we can see that we can match a digit using
\d
.
\d
matches all of the digits as we can see in the image above.
Now let's start by typing \d
thrice which will match any 3 digits in a row.
After matching the first 3 digits, we now need to match the -
dash and .
in the pattern. For now, let's match any character in the position followed by the first 3 digits. Guess what should we type in to match any character ๐ค?? Yep, as we have already learnt, it is a .
(period/dot).
Now let us add the next 3 digits, and it should be simple.
We have matched the first 6 digits and the 1st separator (.
/-
) in the phone number series, now we need to match the remaining 4 digits and the 2nd separator(.
/-
).
As we can see the regular expression matches all 4 of our phone numbers.
Let's take a look at a more realistic example:
Here we can see how the regex we wrote comes in handy while searching for phone numbers in a database of information instead of a literal search.
Now let's be a bit more specific about the separator. Currently, our regex matches with any separator even *
or anything else. But the numbers with such separators are not valid, so we need to rewrite our regex. As you can see in the image below:
To only match a -
dash or a .
dot, we need to use a [ ]
character set. And inside the character set add the characters which are required. In our case, it is obvious that -
and .
will be added.
Now that we replaced .
with [-.]
the regular expression matches only the first two phone numbers.
Inside the character set, you don't need to escape the
.
character.
Let's say we want to match phone numbers starting with 800 or 900, then our regular expression will change as follows:
By now we know that \d
matches with every digit from 0-9. What if we need to match digits only in a particular range, say 2 to 8? For that, we can use a [ ]
(character set).
Instead of trying to match [2345678]
, we can write [2-8]
and it will match all digits between 2 and 8 (2 and 8 included).
Like digits, the same could be done for alphabets. Say if you want to match only lower case alphabets from 'a' to 'z' you can write [a-z]
In the above example, the uppercase alphabets don't get matched. To match both lower and uppercase, we can simply add the uppercase range, i.e [a-zA-Z]
To match more characters for e.g. digits, we can write [a-zA-Z0-9]
^ inside [ ]
We know that outside of the character set [ ]
, ^
matches the beginning of a string. But, within the character set, it negates the set and matches everything that is not in the set.
Let us look at an example. Say we want to match every character except lower case a-z, then we can write it inside the [ ]
like [^a-z]
.
We can see that here it matches everything that is not a lowercase letter.
Another example would be; say we want to match the words that end with 'at' except the word bat. So we can do this by putting b inside the character set, [ ]
preceded by ^
.
Quantifiers
{ }
Let's take the previous example again. The one with the phone numbers. To match the phone numbers we can write, \d\d\d.\d\d\d.\d\d\d\d
(\d
is for matching digit and .
for matching any separator.)
Notice how we are repeating \d
, instead, we can use quantifiers. So to match digits, we write: \d{3}.\d{3}.\d{4}
Much cleaner, isn't it? What the {3}
does is, it matches exactly 3 (digits in our case). Here we are matching exact numbers, but sometimes we don't know the exact number and we may need to use one of these * + ?
quantifiers.
* + ?
Mr. Sharma
Mr Smith
Ms Devi
Mrs. Brown
Mr. J
Let us try to write regex for the above names. First, let's write Regex for names starting with Mr
So we can write Mr
Now some names have a dot after the initials. So we need to write a regex to check the dot.
The \.
checks if there is a dot after Mr and the ?
is for allowing either 1 dot or 0 dot. ie. It checks if there is a dot, but allows to match even if there is no dot present.
Now to match the space after Mr, Mr., we write \s
In the names that are provided, every name starts with a capital letter. So we need to write regex for that.
[A-Z]
checks for any uppercase letter
\w
checks for any word character a-z
, A-Z
, 0-9
, _
. But we want to match the characters after the second letter as well right? Can you guess how can we do that?
We can put an *
quantifier which allows matching of 0 or more instances. As we can see the final regex we have written so far is
Mr\.?\s[A-Z]\w*
but it does not match for Ms, or Mrs. so let's try to incorporate those as well.
What we need is a group ( )
containing cases: After M the possible letters that are mentioned in the question are r, s and rs. So, our group will be (r|s|rs)
. Notice the |
means either or-- either r
or s
or rs
. The final regex will look like this:
Matching Emails
Example:
NeoGCamp@gmail.com
neog.camp@bootcamp.edu
neog-roc8@my-work.net
These are fake email addresses. But let's try to write a regular expression that will match all of these emails.
Coming soon...