Class 22: Regular Expressions

April 16, 2020

Regular expressions are an extremely powerful method of searching and extracting information from strings. A good, basic tutorial is available here.

In the most-common use case, you have a test string that you test against a regular expression string, using the function search from the re module:

In [1]:
import re # we need to import the re module to use it

test_string = "My name is John Doe"

# test whether test_string contains "name"
# (pay attention to the r in front of the string; we need this)
match = re.search(r"name", test_string)
if match: # did we find a match?    
    print("Test string matches.")
    print("Match:", match.group()) # print out the part of the string that matched
else:
    print("Test string doesn't match.")
Test string matches.
Match: name
In [2]:
test_string = "My email is john@utexas.edu"

# test whether test_string contains "name"
match = re.search(r"name", test_string)

if match: # did we find a match?    
    print("Test string matches.")
    print("Match:", match.group())
else:
    print("Test string doesn't match.")
Test string doesn't match.

Much of the power of regular expressions stems from the fact that you can match on general patterns. For example, \S+ will match an arbitrary number of non-whitespace characters:

In [3]:
test_string = "My age is secret."
match = re.search(r"My \S+ is", test_string)
print("Match:", match.group())

test_string = "My mood is good."
match = re.search(r"My \S+ is", test_string)
print("Match:", match.group())
Match: My age is
Match: My mood is

We can also capture substrings using regular expressions, by encapsulating the parts of interest in parentheses ():

In [4]:
test_string = "My age is secret."
match = re.search(r"My (\S+) is (\S+)", test_string)
print("Match:", match.group(0))
print("Captured group 1:" , match.group(1))
print("Captured group 2:" , match.group(2))
Match: My age is secret.
Captured group 1: age
Captured group 2: secret.

Problems

Problem 1

Use the online python regular expression editor available here: http://pythex.org/ to explore regular expressions. For each of the given test strings, find the regular expressions that achieves the given goals.

1. Test string: "my email is: john@utexas.edu"

  • Match on: "my email is"
  • Match on any email address
  • Match on: "@utexas.edu"
  • Capture the entire email address
  • Capture both the part before the @ sign and the part after the @ sign separately
  • Capture the username of any utexas.edu email address

2. Test string: "phone number: 123-456-7890"

  • Match on "phone number:" and capture the phone number
  • Match on any string of the form of a phone number, with three digits, a hyphen, three more digits, another hyphen, and four digits
  • Use the same match as before, but now capture the area code

3. Invent a few more problems and solutions on your own.

Problem 2:

Write python code that can take a string of the form "My name is: ...", extract the name (indicated here by ...), and then print it. Make sure you get the full name, not just the first name.

In [5]:
test_string = "My name is: John Doe"

# your code goes here

If this was easy

Problem 3:

Write a function that can parse phone numbers in any sort of format and print them out in the standard 123-456-7890 format.

In [6]:
def clean_phone_number(input):
    # implement your function here
    pass # delete this, it is here just as a placeholder

# all these calls should produce the number 123-456-7890
clean_phone_number("1234567890")
clean_phone_number("+1 (123) 456-7890")
clean_phone_number("1 123 456 7890")
clean_phone_number("(123) 4567890")
# the function should realize that this is not a valid phone number
clean_phone_number("123456")