Regular expressions (RegEx for short) are special strings that define patterns for matching specific sets of strings. RegEx are a favorite interview question for many developers as it allows you to quickly quiz an interviewee’s ability to decode a problem into smaller parts without needing to write a lot of code.
There are some excellent online tools available to test your regular expression syntax and match. http://regexpal.com/ is an interesting one to play around with. TextMate on Mac and Notepad++ are good alternatives from a desktop perspective.
In this post, we will review some of the basic regular expressions. In future posts, we will look into constructing more complex patterns.
Question: Develop a regular expression to match a US phone number.
Let us take an example phone number – 425-882-8080 (this also happens to be Microsoft’s main line number so don’t call it unless you absolutely have to ).
- The simplest RegEx pattern for this can be the number itself. Yes, that works too. But I don’t think the interviewer would be very happy if you give her this answer.
- Using character classes or sets, we can match a group of characters with or without specifying all of them. For example, [0-9] tells the processor to match any digit in the range of 0 to 9. The square brackets are not literally matched because they are treated specially as meta-characters. A meta-character has special meaning in regular expressions and is reserved. A regular expression in the form [0-9] is called a character class, or sometimes
a character set. In addition, you can be more specific and specify the digits you want matched. For example, [02468] only matches if the input contains one of 0, 2, 4, 6 or 8. As a next step solution for our problem using character classes, the following RegEx will work (but don’t tell the interviewer that this is your final solution yet): [0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9] - Let’s bump up our skills now by using character shorthand. A \d matches any digit. A \D matches any non-digit. Our answer can now be shortened to \d\d\d-\d\d\d-\d\d\d\d (which does an exact match for hyphen(-) or better yet to \d\d\d\D\d\d\d\D\d\d\d\d. Note that instead of using a \D, we could have used a dot (.) which allows you to match to any character.
- Note that wrapping a part of a regular expression in parentheses () creates a group. We will learn more about groups in a future post.
- To shorten our RegEx more, we can enlist the help of Quantifiers. As the name suggests, quantifiers allow you to specify how many times the preceding expression should match. There are a number of ways to specify a quantifier – \d{3} implies match a digit exactly 3 times; The question mark (?) signifies zero or one; plus sign (+), which means one or more, or the asterisk (*) which means zero or more. Given our new knowledge about quantifiers, our answer can be updated to (\d{3}[-]?){2}\d{4} which will match two non-parenthesized sequences of three digits each, followed by an optional hyphen, and then followed by exactly four digits.
- We are almost there. It is now time to make our RegEx more robust, professional, smart and production ready. Let’s add the following features:
- The area code can be optional
- allow literal parentheses to optionally wrap the first sequence of three digits
- The separator character can either be a dot (.) or a hyphen (-)
Our final answer that should be good enough for an interview to match a 10-digit, US phone number, with or without parentheses, hyphens, or dots and optional 3 digit area code can be ^(\(\d{3}\)|^\d{3}[.-]?)?\d{3}[.-]?\d{4}$
Let’s dissect this RegEx on a character by character basis to make sure we are on the right track:
- ^ (caret) at the beginning of the regular expression, or following the vertical bar (|), means that the phone number will be at the beginning of a line.
- ( opens a group.
- \( is a literal open parenthesis.
- \d matches a digit.
- {3} is a quantifier that, following \d, matches exactly three digits.
- \) matches a literal close parenthesis.
- | (the vertical bar) indicates alternation, that is, a given choice of alternatives. In other words, this says “match an area code with parentheses or without them.”
- ^ matches the beginning of a line.
- \d matches a digit.
- {3} is a quantifier that matches exactly three digits.
- [.-]? matches an optional dot or hyphen.
- ) close capturing group.
- ? make the group optional, that is, the prefix in the group is not required.
- \d matches a digit.
- {3} matches exactly three digits.
- [.-]? matches another optional dot or hyphen.
- \d matches a digit.
- {4} matches exactly four digits.
- $ matches the end of a line.
In future posts, we will look at some more advanced regular expressions with examples.
HI,
ReplyDeleteThis is very helpful. Very nicely explained. I have a question for the last step. Seems like you divided the first 3 digits in two groups separated for checking with parenthesis or without parenthesis.
Going by your earlier logic we can check that using a ? so that will indicate 0 or 1 parenthesis instead of making two separate groups for 425 or (425)
.
Thanks.
Nads.
(\-(\d)*).* : Though not perfect even it would do..?
ReplyDelete