Regular Expressions

Version 1107 (Mon Nov 27 20:46:04 2006)

How to count the blank lines in a file?
- Most people consider a line with just spaces and tabs to be blank
- But examining characters one by one is tedious
- More complex patterns (like telephone numbers or email addresses) are hard to describe in code
Use regular expressions (REs) instead
- Represent patterns as strings
- Just like the "*" in the shell's *.txt
Warning: the notation is ugly
- Have to use what's on the keyboard, instead of inventing new symbols the way mathematicians do
Send comments

You Can Skip This Lecture If...

You know what a regular expression is
You understand the difference between «*» and «+»
You know how and why to compile an RE
You know how to find out which part of a string matched which part of an RE
You know how to get all of an RE's matches with one method call
Send comments

A Simple Example

The simplest kind of RE matches a fixed string of characters
- Similar to the in operator

import re

dragons = [
    ['CTAGGTGTACTGATG',    'Antipodean Opaleye'],
    ['AAGATGCGTCCGTAT',    'Common Welsh Green'],
    ['AGTCGTGCTCGTTATATC', 'Hebridean Black'],
    ['ATGCGTCGTCGATTATCT', 'Hungarian Horntail'],
    ['CCGTTAGGGCTAAATGCT', 'Norwegian Ridgeback']
]

for (dna, name) in dragons:
    if re.search('ATGCGT', dna):
        print name

Common Welsh Green
Hungarian Horntail

Send comments

This or That

Modify the regular expression a little

import re

dragons = [
    ['CTAGGTGTACTGATG',    'Antipodean Opaleye'],
    ['AAGATGCGTCCGTAT',    'Common Welsh Green'],
    ['AGTCGTGCTCGTTATATC', 'Hebridean Black'],
    ['ATGCGTCGTCGATTATCT', 'Hungarian Horntail'],
    ['CCGTTAGGGCTAAATGCT', 'Norwegian Ridgeback']
]

for (dna, name) in dragons:
    if re.search('ATGCGT|GCT', dna):
        print name

Common Welsh Green
Hebridean Black
Hungarian Horntail
Norwegian Ridgeback

The vertical bar «|» means “or”
- So this RE matches any string containing either "ATGCGT" or "GCT"
Send comments

Precedence

What about matching either "ATA" or "ATC" (both of which code for isoleucine)?
- «ATA|C» will not work: it matches either "ATA" or "C"
- «ATA|ATC» will work, but it's a bit redundant

Solution: use parentheses, just as in math

import re

tests = [
    ['ATA',   True],
    ['xATCx', True],
    ['ATG',   False],
    ['AT',    False],
    ['ATAC',  True]
]

for (dna, expected) in tests:
    actual = re.search('AT(A|C)', dna) is not None
    assert actual == expected

Note that there's no output: the asserts will crash the program if any of the tests fail

Send comments

Escaping Special Characters

How to match an actual "|", "(", or ")"?
Solution is to use «\|», «$», or «$» in the RE
- And of course «\\» to match a backslash
But in order to put a backslash in a Python string, you have to escape it
- So the written form of the RE is "\\|", "\$", "\$", or "\\\\"
What you type in is being compiled twice:
- Once by Python to create a string
- Once by the regular expression library to create the RE
- Figure 17.1: Double Compilation of Regular Expressions
Send comments

Raw Strings

To help keep things readable, Python supports raw strings
- Written as r'abc' or r"this\nand\nthat"
- Inside a raw string, a backslash is just a backslash
- So r'\n' is a string containing the two characters "\" and "n", not a newline
Raw strings are not automatically converted into REs
- But that is their most common use
Send comments

Sequences

In the shell, "*" matches zero or more characters
In an RE, «*» is an operator that means, “match zero or more occurrences of a pattern”
- Comes after the pattern, not before

Example: match any strand of DNA in which "TTA" and "CTA" are separated by any number of "G"

tests = [
    ['TTACTA',    True],  # separated by zero G's
    ['TTAGCTA',   True],  # separated by one G
    ['TTAGGGCTA', True],  # separated by three G's
    ['TTAXCTA',   False], # an X in the way
    ['TTAGCGCTA', False], # an embedded X in the way
]

for (dna, expected) in tests:
    actual = re.search('TTAG*CTA', dna) is not None
    assert actual == expected

Note that the RE matches "TTACTA" because «G*» can match zero occurrences of "G"
Figure 17.2: Zero or More

«+» matches one or more (i.e., won't match the empty string)
- Figure 17.3: One or More
Send comments

Making Something Optional

The «?» operator means “optional”
- I.e., zero or one occurrences, but no more

assert re.search('AC?T', 'AT')
assert re.search('AC?T', 'ACT')
assert not re.search('AC?T', 'ACCT')

Figure 17.4: Zero or One

Send comments

Character Sets

Use «[]» to match sets of characters
- The expression «[abcd]» matches exactly one "a", "b", "c", or "d"
- Can be abbreviated as «[a-d]»
Often combined with «*», «+», or «?»
- «[aeiou]+» matches any non-empty sequence of vowels

Example: find lines containing numbers

import re

lines = [
    "Charles Darwin (1809-82)",
    "Darwin's principal works, The Origin of Species (1859)",
    "and The Descent of Man (1871) marked a new epoch in our",
    "understanding of our world and ourselves.  His ideas",
    "were shaped by the Beagle's voyage around the world in",
    "1831-36."
]

for line in lines:
    if re.search('[0-9]+', line):
        print line

Charles Darwin (1809-82)
Darwin's principal works, The Origin of Species (1859)
and The Descent of Man (1871) marked a new epoch in our
1831-36.

Try writing this without using regular expressions…

Send comments

Abbreviations

Some character sets occur so often that they have abbreviations

Sequence	Equivalent	Explanation
`«\d»`	`«[0-9]»`	Digits
`«\s»`	`«[ \t\r\n]»`	Whitespace
`«\w»`	`«[a-zA-Z0-9_]»`	Word characters (i.e., those allowed in variable names)
Table 17.1: Regular Expression Escapes in Python

Send comments

Special Cases

«[^abc]» means “anything except the characters in this set”
«.» means “any character except the end of line”
- Equivalent to «[^\n]»
«\b» matchs the break between word and non-word characters
- Doesn't consume any actual characters
- Figure 17.5: Word/Non-Word Breaks

Example: find words that end in a vowel

Use string.split to break on spaces and newlines before applying RE

import re

words = '''Born in New York City in 1918, Richard Feynman earned a
bachelor's degree at MIT in 1939, and a doctorate from Princeton in
1942. After working on the Manhattan Project in Los Alamos during
World War II, he became a professor at CalTech in 1951.  Feynman won
the 1965 Nobel Prize in Physics for his work on quantum
electrodynamics, and served on the commission investigating the
Challenger disaster in 1986.'''.split()

end_in_vowel = set()
for w in words:
    if re.search(r'[aeiou]\b', w):
        end_in_vowel.add(w)
for w in end_in_vowel:
    print w

a
Prize
degree
became
doctorate
the
he

Send comments

Anchoring

How to find blank lines?
- re.search(r'\s*', line) will match "start end"
Use anchors
- «^» matches the beginning of the string
- «$» matches the end
- Neither consumes any characters
- Figure 17.6: Anchoring Matches

Examples:

Pattern	Text	Result
`«b+»`	`"abbc"`	Matches
`«^b+»`	`"abbc"`	Fails (string doesn't start with `b`)
`«c$»`	`"abbc"`	Matches (string ends with `c`)
`«^a*$»`	`aabaa`	Fails (something other than `"a"` between start and end of string)
Table 17.2: Regular Expression Anchors in Python

Send comments

Extracting Matches

Problem: want to find comments in a data file
- A comment starts with a "#", and extends to the end of the line

First try: If the RE matches, split on the "#"

import sys, re

lines = '''Date: 2006-03-07
On duty: HP # 01:30 - 03:00
Observed: Common Welsh Green
On duty: RW #03:00-04:30
Observed: none
On duty: HG # 04:30-06:00
Observed: Hebridean Black
'''.split('\n')

for line in lines:
    if re.search('#', line):
        comment = line.split('#')[1]
        print comment

 01:30 - 03:00
03:00-04:30
 04:30-06:00

Output is inconsistent
split followed by strip seems clumsy

Send comments

Match Objects

Result of re.search is actually a match object that records what what matched, and where

mo.group() returns the whole string that matched the RE
mo.start() and mo.end() are the indices of the match's location

import re

text = 'abbcb'
for pattern in ['b+', 'bc*', 'b+c+']:
    match = re.search(pattern, text)
    print '%s / %s => "%s" (%d, %d)' % \
          (pattern, text, match.group(), match.start(), match.end())

b+ / abbcb => "bb" (1, 3)
bc* / abbcb => "b" (1, 2)
b+c+ / abbcb => "bbc" (1, 4)

Send comments

Match Groups

Every parenthesized subexpression in the RE is a group
- Group 0 is the entire match
- Text that matched N^th parentheses (counting from left) is group N
- mo.group(3) is the text that matched the third subexpression, m.start(3) is where it started

Extracting comments is now easy:

import sys, re

lines = '''Date: 2006-03-07
On duty: HP # 01:30 - 03:00
Observed: Common Welsh Green
On duty: RW #03:00-04:30
Observed: none
On duty: HG # 04:30-06:00
Observed: Hebridean Black
'''.split('\n')

for line in lines:
    match = re.search(r'#\s*(.+)', line)
    if match:
        comment = match.group(1)
        print comment

01:30 - 03:00
03:00-04:30
04:30-06:00

Send comments

Reversing Columns

REs are the power tools of text processing
- Can do things in one line that would otherwise take many lines of code

Example: reverse two-column data

import re

def reverse_columns(line):
    match = re.search(r'^\s*(\d+)\s+(\d+)\s*$', line)
    if not match:
        return line
    return match.group(2) + ' ' + match.group(1)

tests = [
    ['10 20',    'easy case'],
    [' 30  40 ', 'padding'],
    ['60 70 80', 'too many columns'],
    ['90 end',   'non-numeric']
]

for (fixture, title) in tests:
    actual = reverse_columns(fixture)
    print '%s: "%s" => "%s"' % (title, fixture, actual)

easy case: "10 20" => "20 10"
padding: " 30  40 " => "40 30"
too many columns: "60 70 80" => "60 70 80"
non-numeric: "90 end" => "90 end"

Send comments

Compiling

The RE library compiles patterns into a more concise form for matching
- Each regular expression becomes a finite state machine
- Library follows the arcs in the FSM as it reads characters
- Drawing FSMs is a good way to debug REs
- Figure 17.7: Regular Expressions as Finite State Machines
You can improve a program's performance by compiling the RE once, and re-using the compiled form
- Use re.compile(pattern) to get the compiled RE
- Its methods have the same names and behavior as the functions in the re module
- E.g., matcher.search(text) searches text for matches to the RE that was compiled to create matcher
Send comments

Finding Title Case Words

Example: find all Title Case words in a document

import re

# Put pattern outside 'find_all' so that it's only compiled once.
pattern = re.compile(r'\b([A-Z][a-z]*)\b(.*)')

def find_all(line):
    result = []
    match = pattern.search(line)
    while match:
        result.append(match.group(1))
        match = pattern.search(match.group(2))
    return result

lines = [
    'This has several Title Case words',
    'on Each Line (Some in parentheses).'
]
for line in lines:
    print line
    for word in find_all(line):
        print '\t', word

This has several Title Case words
	This
	Title
	Case
on Each Line (Some in parentheses).
	Each
	Line
	Some

Send comments

Finding All Matches

Notice how the function gets all matches:
- Pattern captures what we want in group 1, and everything else on the line in group 2
- Each time there's a match, continue the search in the remainder captured in group 2

Much easier to use the findall method

import re

lines = [
    'This has several Title Case words',
    'on Each Line (Some in parentheses).'
]
pattern = re.compile(r'\b([A-Z][a-z]*)\b')
for line in lines:
    print line
    for word in pattern.findall(line):
        print '\t', word

This has several Title Case words
	This
	Title
	Case
on Each Line (Some in parentheses).
	Each
	Line
	Some

Send comments

Reference Material

Pattern	Matches	Doesn't Match	Explanation
`«a*»`	`""`, `"a"`, `"aa"`, …	`"A"`, `"b"`	`«*»` means “zero or more” matching is case sensitive
`«b+»`	`"b"`, `"bb"`, …	`""`	`«+»` means “one or more”
`«ab?c»`	`"ac"`, `"abc"`	`"a"`, `"abbc"`	`«?»` means “optional” (zero or one)
`«[abc]»`	`"a"`, `"b"`, or `"c"`	`"ab"`, `"d"`	`«[…]»` means “one character from a set”
`«[a-c]»`	`"a"`, `"b"`, or `"c"`	Character ranges can be abbreviated
`«[abc]*»`	`""`, `"ac"`, `"baabcab"`, …	Operators can be combined: zero or more choices from `"a"`, `"b"`, or `"c"`
Table 17.3: Regular Expression Operators

Method	Purpose	Example	Result
`split`	Split a string on a pattern.	`re.split('\\s,\\s', 'a, b ,c , d')`	`['a', 'b', 'c', 'd']`
`findall`	Find all matches for a pattern.	`re.findall('\\b[A-Z][a-z]*', 'Some words in Title Case.')`	`['Some', 'Title', 'Case']`
`sub`	Replace matches with new text.	`re.sub('\\d+', 'NUM', 'If 123 is 456')`	`"If NUM is NUM"`
Table 17.4: Regular Expression Object Methods

Send comments

But Wait, There's More

We've only scratched the surface
- Regular expressions have proved to be too useful to remain clean and elegant
For example, use «pat{N}» to match exactly N occurrences of a pattern
- More generally, «pat{M,N}» matches between M and N occurrences
Most important thing is to build up complex REs one step at a time
- Write something that matches part of what you're looking for
- Test it
- Add to it
Send comments

Summary

Regular expressions are available in almost every language
- As a library: C/C++, Java, …
- Built into the language: Perl, Ruby, …
- Syntax varies slightly, but the ideas are the same
For a broader tutorial, see [Wilson 2005]
- And if you're going to be doing serious work, check out [Good 2005] or [Friedl 2002]
Send comments

Exercises

Exercise 17.1:

By default, regular expression matches are greedy: the first term in the RE matches as much as it can, then the second part, and so on. As a result, if you apply the RE «X(.*)X(.*)» to the string "XaX and XbX", the first group will contain "aX and Xb", and the second group will be empty.

It's also possible to make REs match reluctantly, i.e., to have the parts match as little as possible, rather than as much. Find out how to do this, and then modify the RE in the previous paragraph so that the first group winds up containing "a", and the second group " and XbX".

Exercise 17.2:

What the easiest way to write a case-insensitive regular expression? (Hint: read the documentation on compilation options.)

Exercise 17.3:

What does the VERBOSE option do when compiling a regular expression? Use it to rewrite some of the REs in this lecture in a more readable way.

Exercise 17.4:

What does the DOTALL option do when compiling a regular expression? Use it to get rid of the call to string.split in the example that finds words ending in vowels.

Send comments