Class 22: Regular Expressions

April 5, 2018

Regular expressions are an extremely powerful method of searching and extracting information from strings. A good, basic tutorial is available here.

In the most-common use case, you have a test string that you test against a regular expression string, using the function search from the re module:

In [1]:
import re # we need to import the re module to use it

test_string = "My name is John Doe"

# test whether test_string contains "name"
# (pay attention to the r in front of the string; we need this)
match ="name", test_string)
if match: # did we find a match?    
    print("Test string matches.")
    print("Match:", # print out the part of the string that matched
    print("Test string doesn't match.")
Test string matches.
Match: name
In [2]:
test_string = "My email is"

# test whether test_string contains "name"
match ="name", test_string)

if match: # did we find a match?    
    print("Test string matches.")
    print("Test string doesn't match.")
Test string doesn't match.

Much of the power of regular expressions stems from the fact that you can match on general patterns. For example, \S+ will match an arbitrary number of non-whitespace characters:

In [3]:
test_string = "My age is secret."
match ="My \S+ is", test_string)

test_string = "My mood is good."
match ="My \S+ is", test_string)
Match: My age is
Match: My mood is

We can also capture substrings using regular expressions, by encapsulating the parts of interest in parentheses ():

In [4]:
test_string = "My age is secret."
match ="My (\S+) is (\S+)", test_string)
print("Captured group 1:" ,
print("Captured group 2:" ,
Match: My age is secret.
Captured group 1: age
Captured group 2: secret.


Problem 1

Use the online python regular expression editor available here: to explore regular expressions. For each of the given test strings, find the regular expressions that achieves the given goals.

1. Test string: "my email is:"

  • Match on: "my email is"
  • Match on any email address
  • Match on: ""
  • Capture the entire email address
  • Capture both the part before the @ sign and the part after the @ sign separately
  • Capture the username of any email address

2. Test string: "phone number: 123-456-7890"

  • Match on "phone number:" and capture the phone number
  • Match on any string of the form of a phone number, with three digits, a hyphen, three more digits, another hyphen, and four digits
  • Use the same match as before, but now capture the area code

3. Invent a few more problems and solutions on your own.

Problem 2:

Write python code that can take a string of the form "My name is: ...", extract the name (indicated here by ...), and then print it. Make sure you get the full name, not just the first name.

In [5]:
test_string = "My name is: John Doe"

# your code goes here

If this was easy

Problem 3:

Write a function that can parse phone numbers in any sort of format and print them out in the standard 123-456-7890 format.

In [6]:
def clean_phone_number(input):
    # implement your function here
    pass # delete this, it is here just as a placeholder

# all these calls should produce the number 123-456-7890
clean_phone_number("+1 (123) 456-7890")
clean_phone_number("1 123 456 7890")
clean_phone_number("(123) 4567890")
# the function should realize that this is not a valid phone number