*
in the shell's *.txt
in
operatorimport re
dragons = [
['CTAGGTGTACTGATG', 'Antipodean Opaleye'],
['AAGATGCGTCCGTAT', 'Common Welsh Green'],
['AGTCGTGCTCGTTATATC', 'Hebridean Black'],
['ATGCGTCGTCGATTATCT', 'Hungarian Horntail'],
['CCGTTAGGGCTAAATGCT', 'Norwegian Ridgeback']
]
for (dna, name) in dragons:
if re.search('ATGCGT', dna):
print name
Common Welsh Green Hungarian Horntail
import re
dragons = [
['CTAGGTGTACTGATG', 'Antipodean Opaleye'],
['AAGATGCGTCCGTAT', 'Common Welsh Green'],
['AGTCGTGCTCGTTATATC', 'Hebridean Black'],
['ATGCGTCGTCGATTATCT', 'Hungarian Horntail'],
['CCGTTAGGGCTAAATGCT', 'Norwegian Ridgeback']
]
for (dna, name) in dragons:
if re.search('ATGCGT|GCT', dna):
print name
Common Welsh Green Hebridean Black Hungarian Horntail Norwegian Ridgeback
|
means “or”
"ATGCGT"
or "GCT"
"ATA"
or "ATC"
(both of which code for isoleucine)?
ATA|C
will
not work: it matches either
"ATA"
or "C"
ATA|ATC
will
work, but it's a bit redundantimport re
tests = [
['ATA', True],
['xATCx', True],
['ATG', False],
['AT', False],
['ATAC', True]
]
for (dna, expected) in tests:
actual = re.search('AT(A|C)', dna) is not None
assert actual == expected
assert
statement will
crash the program if any of the tests fail"|"
,
"("
,
or ")"
?\|
,
\(
,
or \)
in the RE
\\
to match a backslash"\\|"
,
"\\("
,
"\\)"
,
or "\\\\"
r'abc'
or r"this\nand\nthat"
r'\n'
is a string containing the
two characters "\"
and "n"
, not a newline"*"
matches zero or more characters*
is an operator that means, “match zero or more
occurrences of a pattern”
"TTA"
and
"CTA"
are separated by any
number of "G"
tests = [
['TTACTA', True], # separated by zero G's
['TTAGCTA', True], # separated by one G
['TTAGGGCTA', True], # separated by three G's
['TTAXCTA', False], # an X in the way
['TTAGCGCTA', False], # an embedded X in the way
]
for (dna, expected) in tests:
actual = re.search('TTAG*CTA', dna) is not None
assert actual == expected
"TTACTA"
because
G*
can match
zero occurrences of "G"
+
matches one or more (i.e., won't match the empty string)assert re.search('TTAG*CTA', 'TTACTA')
assert not re.search('TTAG+CTA', 'TTACTA')
?
operator means “optional”
assert re.search('AC?T', 'AT')
assert re.search('AC?T', 'ACT')
assert not re.search('AC?T', 'ACCT')
[]
to match sets of characters
[abcd]
matches exactly one "a"
,
"b"
,
"c"
,
or "d"
[a-d]
*
,
+
,
or ?
[aeiou]+
matches any non-empty sequence of vowelsimport re
lines = [
"Charles Darwin (1809-82)",
"Darwin's principal works, The Origin of Species (1859)",
"and The Descent of Man (1871) marked a new epoch in our",
"understanding of our world and ourselves. His ideas",
"were shaped by the Beagle's voyage around the world in",
"1831-36."
]
for line in lines:
if re.search('[0-9]+', line):
print line
Charles Darwin (1809-82) Darwin's principal works, The Origin of Species (1859) and The Descent of Man (1871) marked a new epoch in our 1831-36.
Sequence | Equivalent | Explanation |
---|---|---|
\d |
[0-9] |
Digits |
\s |
[ \t\r\n] |
Whitespace |
\w |
[a-zA-Z0-9_] |
Word characters (i.e., those allowed in variable names) |
Regular Expression Escapes in Python |
[^abc]
means
“anything except the characters in this set”.
means “any character
except the end of line”
[^\n]
\b
matchs the break between word
and non-word characters
split
method to break on whitespace
before applying REimport re
words = '''Born in New York City in 1918, Richard Feynman earned a
bachelor's degree at MIT in 1939, and a doctorate from Princeton in
1942. After working on the Manhattan Project in Los Alamos during
World War II, he became a professor at CalTech in 1951. Feynman won
the 1965 Nobel Prize in Physics for his work on quantum
electrodynamics, and served on the commission investigating the
Challenger disaster in 1986.'''.split()
end_in_vowel = set()
for w in words:
if re.search(r'[aeiou]\b', w):
end_in_vowel.add(w)
for w in end_in_vowel:
print w
a Prize degree became doctorate the he
re.search(r'\s*', line)
will match "start end"
^
matches the
beginning of the string$
matches the endPattern | Text | Result |
---|---|---|
b+ |
"abbc" |
Matches |
^b+ |
"abbc" |
Fails (string doesn't start with
b ) |
c$ |
"abbc" |
Matches (string ends with c ) |
^a*$ |
aabaa |
Fails (something other than
"a"
between start and end of string) |
Regular Expression Anchors in Python |
"#"
,
and extends to the end of the line"#"
import sys, re
lines = '''Date: 2006-03-07
On duty: HP # 01:30 - 03:00
Observed: Common Welsh Green
On duty: RW #03:00-04:30
Observed: none
On duty: HG # 04:30-06:00
Observed: Hebridean Black
'''.split('\n')
for line in lines:
if re.search('#', line):
comment = line.split('#')[1]
print comment
01:30 - 03:00 03:00-04:30 04:30-06:00
split
followed by
strip
seems clumsyre.search
is actually a
match object that records what what matched, and where
mo.group()
returns the whole string
that matched the REmo.start()
and mo.end()
are the indices of the match's locationimport re
text = 'abbcb'
for pattern in ['b+', 'bc*', 'b+c+']:
mo = re.search(pattern, text)
print '%s / %s => "%s" (%d, %d)' % (pattern, text, mo.group(), mo.start(), mo.end())
b+ / abbcb => "bb" (1, 3) bc* / abbcb => "b" (1, 2) b+c+ / abbcb => "bbc" (1, 4)
mo.group(3)
is the text that matched the
third subexpression, m.start(3)
is where it
startedimport sys, re
lines = '''Date: 2006-03-07
On duty: HP # 01:30 - 03:00
Observed: Common Welsh Green
On duty: RW #03:00-04:30
Observed: none
On duty: HG # 04:30-06:00
Observed: Hebridean Black
'''.split('\n')
for line in lines:
match = re.search(r'#\s*(.+)', line)
if match:
comment = match.group(1)
print comment
01:30 - 03:00 03:00-04:30 04:30-06:00
import re
def reverse_columns(line):
match = re.search(r'^\s*(\d+)\s+(\d+)\s*$', line)
if not match:
return line
return match.group(2) + ' ' + match.group(1)
tests = [
['10 20', 'easy case'],
[' 30 40 ', 'padding'],
['60 70 80', 'too many columns'],
['90 end', 'non-numeric']
]
for (fixture, title) in tests:
actual = reverse_columns(fixture)
print '%s: "%s" => "%s"' % (title, fixture, actual)
easy case: "10 20" => "20 10" padding: " 30 40 " => "40 30" too many columns: "60 70 80" => "60 70 80" non-numeric: "90 end" => "90 end"
re.compile(pattern)
to get the
compiled REre
modulematcher.search(text)
searches
text
for matches to the RE that was compiled
to create matcher
import re
# Put pattern outside 'find_all' so that it's only compiled once.
pattern = re.compile(r'\b([A-Z][a-z]*)\b(.*)')
def find_all(line):
result = []
match = pattern.search(line)
while match:
result.append(match.group(1))
match = pattern.search(match.group(2))
return result
lines = [
'This has several Title Case words',
'on Each Line (Some in parentheses).'
]
for line in lines:
print line
for word in find_all(line):
print '\t', word
This has several Title Case words This Title Case on Each Line (Some in parentheses). Each Line Some
findall
findall
method
import re
lines = [
'This has several Title Case words',
'on Each Line (Some in parentheses).'
]
pattern = re.compile(r'\b([A-Z][a-z]*)\b')
for line in lines:
print line
for word in pattern.findall(line):
print '\t', word
This has several Title Case words This Title Case on Each Line (Some in parentheses). Each Line Some
Pattern | Matches | Doesn't Match | Explanation |
---|---|---|---|
a* |
"" ,
"a" ,
"aa" , … |
"A" ,
"b" |
* means
“zero or more”
matching is case sensitive |
b+ |
"b" ,
"bb" , … |
"" |
+ means
“one or more” |
ab?c |
"ac" ,
"abc" |
"a" ,
"abbc" |
? means
“optional” (zero or one) |
[abc] |
"a" ,
"b" ,
or "c" |
"ab" ,
"d" |
[…] means
“one character from a set” |
[a-c] |
"a" ,
"b" ,
or "c" |
Character ranges can be abbreviated | |
[abc]* |
"" ,
"ac" ,
"baabcab" , … |
Operators can be combined: zero or more choices
from "a" ,
"b" ,
or "c" |
|
Regular Expression Operators |
Method | Purpose | Example | Result |
---|---|---|---|
split |
Split a string on a pattern. | re.split('\\s*,\\s*', 'a, b ,c , d') |
['a', 'b', 'c', 'd'] |
findall |
Find all matches for a pattern. | re.findall('\\b[A-Z][a-z]*',
'Some words in Title Case.') |
['Some', 'Title', 'Case'] |
sub |
Replace matches with new text. | re.sub('\\d+', 'NUM', 'If 123 is 456') |
"If NUM is NUM" |
Regular Expression Object Methods |
pat{N}
to match exactly N occurrences of a pattern
pat{M,N}
matches between M
and N occurrencesBy default, regular expression matches are
greedy:
the first term in the RE matches as much as it can,
then the second part, and so on. As a result, if you apply the RE
X(.*)X(.*)
to the string
"XaX and XbX"
,
the first group will contain "aX and Xb"
,
and the second group will be empty.
It's also possible to make REs match
reluctantly, i.e., to have the
parts match as little as possible, rather than as much. Find out
how to do this, and then modify the RE in the previous paragraph
so that the first group winds up containing
"a"
, and the
second group " and XbX"
.
What is the easiest way to write a case-insensitive regular expression? (Hint: read the documentation on compilation options.)
What does the VERBOSE
option do when
compiling a regular expression?
Use it to rewrite some of the REs in this lecture in
a more readable way.
What does the DOTALL
option do when
compiling a regular expression?
Use it to get rid of the call to string.split
in the example that finds words ending in vowels.