Please navigate to the bottom of the page for Table of Contents

Thursday, July 26, 2012

Regular expressions interview questions–Part 1

Regular expressions (RegEx for short) are special strings that define patterns for matching specific sets of strings. RegEx are a favorite interview question for many developers as it allows you to quickly quiz an interviewee’s ability to decode a problem into smaller parts without needing to write a lot of code.

There are some excellent online tools available to test your regular expression syntax and match. http://regexpal.com/ is an interesting one to play around with. TextMate on Mac and Notepad++ are good alternatives from a desktop perspective.

In this post, we will review some of the basic regular expressions. In future posts, we will look into constructing more complex patterns.

Question: Develop a regular expression to match a US phone number.

Let us take an example phone number – 425-882-8080 (this also happens to be Microsoft’s main line number so don’t call it unless you absolutely have to ).

  1. The simplest RegEx pattern for this can be the number itself. Yes, that works too. But I don’t think the interviewer would be very happy if you give her this answer.
  2. Using character classes or sets, we can  match a group of characters with or without specifying all of them. For example, [0-9] tells the processor to match any digit in the range of 0 to 9. The square brackets are not literally matched because they are treated specially as meta-characters. A meta-character has special meaning in regular expressions and is reserved. A regular expression in the form [0-9] is called a character class, or sometimes
    a character set. In addition, you can be more specific and specify the digits you want matched. For example, [02468] only matches if the input contains one of 0, 2, 4, 6 or 8. As a next step solution for our problem using character classes, the following RegEx will work (but don’t tell the interviewer that this is your final solution yet): [0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]
  3. Let’s bump up our skills now by using character shorthand. A \d matches any digit. A \D matches any non-digit. Our answer can now be shortened to \d\d\d-\d\d\d-\d\d\d\d (which does an exact match for hyphen(-) or better yet to \d\d\d\D\d\d\d\D\d\d\d\d. Note that instead of using a \D, we could have used a dot (.) which allows you to match to any character.
  4. Note that wrapping a part of a regular expression in parentheses () creates a group. We will learn more about groups in a future post.
  5. To shorten our RegEx more, we can enlist the help of Quantifiers. As the name suggests, quantifiers allow you to specify how many times the preceding expression should match. There are a number of ways to specify a quantifier – \d{3} implies match a digit exactly 3 times; The question mark (?) signifies zero or one; plus sign (+), which means one or more, or the asterisk (*) which means zero or more. Given our new knowledge about quantifiers, our answer can be updated to (\d{3}[-]?){2}\d{4} which will match two non-parenthesized sequences of three digits each, followed by an optional hyphen, and then followed by exactly four digits.
  6. We are almost there. It is now time to make our RegEx more robust, professional, smart and production ready. Let’s add the following features:
    • The area code can be optional
    • allow literal parentheses to optionally wrap the first sequence of three digits
    • The separator character can either be a dot (.) or a hyphen (-)

Our final answer that should be good enough for an interview to match a 10-digit, US phone number, with or without parentheses, hyphens, or dots and optional 3 digit area code  can be ^(\(\d{3}\)|^\d{3}[.-]?)?\d{3}[.-]?\d{4}$

Let’s dissect this RegEx on a character by character basis to make sure we are on the right track:

  • ^ (caret) at the beginning of the regular expression, or following the vertical bar (|), means that the phone number will be at the beginning of a line.
  • ( opens a group.
  • \( is a literal open parenthesis.
  • \d matches a digit.
  • {3} is a quantifier that, following \d, matches exactly three digits.
  • \) matches a literal close parenthesis.
  • | (the vertical bar) indicates alternation, that is, a given choice of alternatives. In other words, this says “match an area code with parentheses or without them.”
  • ^ matches the beginning of a line.
  • \d matches a digit.
  • {3} is a quantifier that matches exactly three digits.
  • [.-]? matches an optional dot or hyphen.
  • ) close capturing group.
  • ? make the group optional, that is, the prefix in the group is not required.
  • \d matches a digit.
  • {3} matches exactly three digits.
  • [.-]? matches another optional dot or hyphen.
  • \d matches a digit.
  • {4} matches exactly four digits.
  • $ matches the end of a line.

In future posts, we will look at some more advanced regular expressions with examples.

4 comments:

  1. HI,
    This is very helpful. Very nicely explained. I have a question for the last step. Seems like you divided the first 3 digits in two groups separated for checking with parenthesis or without parenthesis.
    Going by your earlier logic we can check that using a ? so that will indicate 0 or 1 parenthesis instead of making two separate groups for 425 or (425)
    .
    Thanks.
    Nads.

    ReplyDelete
  2. (\-(\d)*).* : Though not perfect even it would do..?

    ReplyDelete
  3. Thanks its simple and nice

    ReplyDelete