Please send questions to
st10@humboldt.edu .
CIS 480 - Python - Week 9 Lecture
10-18-05
Miscellaneous "white board" projections
re = name of Python's regular expression module re
RE = an acronym for Regular Expression
(and in python, you write an RE in the form of a string)
* to USE the re module, you of course first import it:
import re
* when you call re.compile with a string,
you compile that regular expression string to get a regular expression object
* you can THEN call that regular expression's object's METHODS as you wish...
* here are a few of the methods available for a regular expression object:
* match()
my_re_obj.match("hello")
determine if the RE matches at the beginning of the argument string
* search()
determine if the RE matches ANYWHERE in the argument string
(BOTH of these return "match objects" if a match is found,
and return None (the NoneType literal, remember) if NO match is found.
* findall()
finds all the substrings where the RE matches, and returns them as a list
* And since an RE object returns a match object with match object methods,
here are a few of them:
* group() - returns the string matched by the RE
>>> my_match_obj = my_re_obj.match('abcdabc')
>>> my_match_obj.group()
'abc'
* start() - returns the starting position of the match
* end() - returns the ending position of the match
* span() - returns a tuple containing the start and end positions of the match
...these END positions are SLICE-friendly: end position is one PAST the last character
matched!
* a COMPLETE list of the re module RE metacharacters (acc. to Kuchling, at least):
. ^ $ * + ? { } [ ] \ | ( )
* [ ] specify a character class ---
you are saying you want to match ONE character in this class
[aeiou]
* you can indicate a range of characters with a dash
[a-f] is the same as [abcdef]
[0-9] is the same as [0123456789]
[a-z] is all the lowercase letters
[A-Z] is all the uppercase letters
* what if you want to match a backslash, or square bracket?
\\ will do it,
\[
\]
* ^ as the FIRST THING in a character class means match any character NOT in this
set
[^aeiou] -- mean any character that isn't a lowercase vowel
* some predefined special sequences, for your convenience
(in both RE's "by themselves" and within character classes)
\d any decimal digit, [0-9]
\D any NON digit, [^0-9]
\s matches any whitespace character, [ \t\n\r\f\v]
\S matches any non-whitespace character [^ \t\n\r\f\v]
\w matches any alphanumeric character [a-zA-Z0-9_]
\W matches any non-alphanumeric character [^a-zA-Z0-9_]
* metacharacter . matches ANY single character
'd.g' - d, then any single character, then g
* . doesn't match newline, however, UNLESS you specify a particular
mode (DOTALL) --- see Kuchling tutorial for more on this.
* metachacter *
repeat WHAT PRECEDES IT 0 or more times!!
'ca*t' ct cat caaat caaaaaaat
c, followed by 0 or more a's, followed by t
+ : 1 or more of what precedes it
? : 0 or 1 of what precedes it ("optional")
'home[ -]?brew' --- matches 'homebrew', 'home-brew', 'home brew'
* note that *-matching is greedy --- matches the largest substring it can
(and then back-tracks if necessary, to satisfy the overall RE...)