prev | Version 1107 (Mon Nov 27 20:46:04 2006) | next |
"*"
in the shell's *.txt
«*»
and «+»
in
operatorimport re dragons = [ ['CTAGGTGTACTGATG', 'Antipodean Opaleye'], ['AAGATGCGTCCGTAT', 'Common Welsh Green'], ['AGTCGTGCTCGTTATATC', 'Hebridean Black'], ['ATGCGTCGTCGATTATCT', 'Hungarian Horntail'], ['CCGTTAGGGCTAAATGCT', 'Norwegian Ridgeback'] ] for (dna, name) in dragons: if re.search('ATGCGT', dna): print name
Common Welsh Green Hungarian Horntail
import re dragons = [ ['CTAGGTGTACTGATG', 'Antipodean Opaleye'], ['AAGATGCGTCCGTAT', 'Common Welsh Green'], ['AGTCGTGCTCGTTATATC', 'Hebridean Black'], ['ATGCGTCGTCGATTATCT', 'Hungarian Horntail'], ['CCGTTAGGGCTAAATGCT', 'Norwegian Ridgeback'] ] for (dna, name) in dragons: if re.search('ATGCGT|GCT', dna): print name
Common Welsh Green Hebridean Black Hungarian Horntail Norwegian Ridgeback
«|»
means “or”"ATGCGT"
or "GCT"
"ATA"
or "ATC"
(both of which code for isoleucine)?«ATA|C»
will not work: it matches either "ATA"
or "C"
«ATA|ATC»
will work, but it's a bit redundantimport re tests = [ ['ATA', True], ['xATCx', True], ['ATG', False], ['AT', False], ['ATAC', True] ] for (dna, expected) in tests: actual = re.search('AT(A|C)', dna) is not None assert actual == expected
assert
s will crash the program if any of the tests fail"|"
, "("
, or ")"
?«\|»
, «\(»
, or «\)»
in the RE«\\»
to match a backslash"\\|"
, "\\("
, "\\)"
, or "\\\\"
Figure 17.1: Double Compilation of Regular Expressions
r'abc'
or r"this\nand\nthat"
r'\n'
is a string containing the two characters "\"
and "n"
, not a newline"*"
matches zero or more characters«*»
is an operator that means, “match zero or more occurrences of a pattern”"TTA"
and "CTA"
are separated by any number of "G"
tests = [ ['TTACTA', True], # separated by zero G's ['TTAGCTA', True], # separated by one G ['TTAGGGCTA', True], # separated by three G's ['TTAXCTA', False], # an X in the way ['TTAGCGCTA', False], # an embedded X in the way ] for (dna, expected) in tests: actual = re.search('TTAG*CTA', dna) is not None assert actual == expected
"TTACTA"
because «G*»
can match zero occurrences of "G"
Figure 17.2: Zero or More
«+»
matches one or more (i.e., won't match the empty string)assert re.search('TTAG*CTA', 'TTACTA') assert not re.search('TTAG+CTA', 'TTACTA')
Figure 17.3: One or More
«?»
operator means “optional”assert re.search('AC?T', 'AT') assert re.search('AC?T', 'ACT') assert not re.search('AC?T', 'ACCT')
Figure 17.4: Zero or One
«[]»
to match sets of characters«[abcd]»
matches exactly one "a"
, "b"
, "c"
, or "d"
«[a-d]»
«*»
, «+»
, or «?»
«[aeiou]+»
matches any non-empty sequence of vowelsimport re lines = [ "Charles Darwin (1809-82)", "Darwin's principal works, The Origin of Species (1859)", "and The Descent of Man (1871) marked a new epoch in our", "understanding of our world and ourselves. His ideas", "were shaped by the Beagle's voyage around the world in", "1831-36." ] for line in lines: if re.search('[0-9]+', line): print line
Charles Darwin (1809-82) Darwin's principal works, The Origin of Species (1859) and The Descent of Man (1871) marked a new epoch in our 1831-36.
Sequence | Equivalent | Explanation |
---|---|---|
«\d» | «[0-9]» | Digits |
«\s» | «[ \t\r\n]» | Whitespace |
«\w» | «[a-zA-Z0-9_]» | Word characters (i.e., those allowed in variable names) |
Table 17.1: Regular Expression Escapes in Python |
«[^abc]»
means “anything except the characters in this set”«.»
means “any character except the end of line”«[^\n]»
«\b»
matchs the break between word and non-word charactersFigure 17.5: Word/Non-Word Breaks
string.split
to break on spaces and newlines before applying REimport re words = '''Born in New York City in 1918, Richard Feynman earned a bachelor's degree at MIT in 1939, and a doctorate from Princeton in 1942. After working on the Manhattan Project in Los Alamos during World War II, he became a professor at CalTech in 1951. Feynman won the 1965 Nobel Prize in Physics for his work on quantum electrodynamics, and served on the commission investigating the Challenger disaster in 1986.'''.split() end_in_vowel = set() for w in words: if re.search(r'[aeiou]\b', w): end_in_vowel.add(w) for w in end_in_vowel: print w
a Prize degree became doctorate the he
re.search(r'\s*', line)
will match "start end"
«^»
matches the beginning of the string«$»
matches the endFigure 17.6: Anchoring Matches
Pattern | Text | Result |
---|---|---|
«b+» | "abbc" | Matches |
«^b+» | "abbc" | Fails (string doesn't start with b ) |
«c$» | "abbc" | Matches (string ends with c ) |
«^a*$» | aabaa | Fails (something other than "a" between start and end of string) |
Table 17.2: Regular Expression Anchors in Python |
"#"
, and extends to the end of the line"#"
import sys, re lines = '''Date: 2006-03-07 On duty: HP # 01:30 - 03:00 Observed: Common Welsh Green On duty: RW #03:00-04:30 Observed: none On duty: HG # 04:30-06:00 Observed: Hebridean Black '''.split('\n') for line in lines: if re.search('#', line): comment = line.split('#')[1] print comment
01:30 - 03:00 03:00-04:30 04:30-06:00
split
followed by strip
seems clumsyre.search
is actually a match object that records what what matched, and wheremo.group()
returns the whole string that matched the REmo.start()
and mo.end()
are the indices of the match's locationimport re text = 'abbcb' for pattern in ['b+', 'bc*', 'b+c+']: match = re.search(pattern, text) print '%s / %s => "%s" (%d, %d)' % \ (pattern, text, match.group(), match.start(), match.end())
b+ / abbcb => "bb" (1, 3) bc* / abbcb => "b" (1, 2) b+c+ / abbcb => "bbc" (1, 4)
mo.group(3)
is the text that matched the third subexpression, m.start(3)
is where it startedimport sys, re lines = '''Date: 2006-03-07 On duty: HP # 01:30 - 03:00 Observed: Common Welsh Green On duty: RW #03:00-04:30 Observed: none On duty: HG # 04:30-06:00 Observed: Hebridean Black '''.split('\n') for line in lines: match = re.search(r'#\s*(.+)', line) if match: comment = match.group(1) print comment
01:30 - 03:00 03:00-04:30 04:30-06:00
import re def reverse_columns(line): match = re.search(r'^\s*(\d+)\s+(\d+)\s*$', line) if not match: return line return match.group(2) + ' ' + match.group(1) tests = [ ['10 20', 'easy case'], [' 30 40 ', 'padding'], ['60 70 80', 'too many columns'], ['90 end', 'non-numeric'] ] for (fixture, title) in tests: actual = reverse_columns(fixture) print '%s: "%s" => "%s"' % (title, fixture, actual)
easy case: "10 20" => "20 10" padding: " 30 40 " => "40 30" too many columns: "60 70 80" => "60 70 80" non-numeric: "90 end" => "90 end"
Figure 17.7: Regular Expressions as Finite State Machines
re.compile(pattern)
to get the compiled REre
modulematcher.search(text)
searches text
for matches to the RE that was compiled to create matcher
import re # Put pattern outside 'find_all' so that it's only compiled once. pattern = re.compile(r'\b([A-Z][a-z]*)\b(.*)') def find_all(line): result = [] match = pattern.search(line) while match: result.append(match.group(1)) match = pattern.search(match.group(2)) return result lines = [ 'This has several Title Case words', 'on Each Line (Some in parentheses).' ] for line in lines: print line for word in find_all(line): print '\t', word
This has several Title Case words This Title Case on Each Line (Some in parentheses). Each Line Some
findall
methodimport re lines = [ 'This has several Title Case words', 'on Each Line (Some in parentheses).' ] pattern = re.compile(r'\b([A-Z][a-z]*)\b') for line in lines: print line for word in pattern.findall(line): print '\t', word
This has several Title Case words This Title Case on Each Line (Some in parentheses). Each Line Some
Pattern | Matches | Doesn't Match | Explanation |
---|---|---|---|
«a*» | "" , "a" , "aa" , … | "A" , "b" | «*» means “zero or more” matching is case sensitive |
«b+» | "b" , "bb" , … | "" | «+» means “one or more” |
«ab?c» | "ac" , "abc" | "a" , "abbc" | «?» means “optional” (zero or one) |
«[abc]» | "a" , "b" , or "c" | "ab" , "d" | «[…]» means “one character from a set” |
«[a-c]» | "a" , "b" , or "c" | Character ranges can be abbreviated | |
«[abc]*» | "" , "ac" , "baabcab" , … | Operators can be combined: zero or more choices from "a" , "b" , or "c" | |
Table 17.3: Regular Expression Operators |
Method | Purpose | Example | Result |
---|---|---|---|
split | Split a string on a pattern. | re.split('\\s*,\\s*', 'a, b ,c , d') | ['a', 'b', 'c', 'd'] |
findall | Find all matches for a pattern. | re.findall('\\b[A-Z][a-z]*', 'Some words in Title Case.') | ['Some', 'Title', 'Case'] |
sub | Replace matches with new text. | re.sub('\\d+', 'NUM', 'If 123 is 456') | "If NUM is NUM" |
Table 17.4: Regular Expression Object Methods |
«pat{N}»
to match exactly N occurrences of a pattern«pat{M,N}»
matches between M and N occurrencesExercise 17.1:
By default, regular expression matches are
greedy: the first term in the RE
matches as much as it can, then the second part, and so on. As a
result, if you apply the RE «X(.*)X(.*)»
to the string
"XaX and XbX"
, the first group will contain "aX and Xb"
,
and the second group will be empty.
It's also possible to make REs match
reluctantly, i.e., to have the
parts match as little as possible, rather than as much. Find out
how to do this, and then modify the RE in the previous paragraph
so that the first group winds up containing "a"
, and the
second group " and XbX"
.
Exercise 17.2:
What the easiest way to write a case-insensitive regular expression? (Hint: read the documentation on compilation options.)
Exercise 17.3:
What does the VERBOSE
option do when compiling a regular
expression? Use it to rewrite some of the REs in this lecture in
a more readable way.
Exercise 17.4:
What does the DOTALL
option do when compiling a regular
expression? Use it to get rid of the call to
string.split
in the example that finds words ending in
vowels.
prev | Copyright © 2005-06 Python Software Foundation. | next |