480lect09_stuff

Please send questions to st10@humboldt.edu .

CIS 480 - Python - Week 9 Lecture
10-18-05
Miscellaneous "white board" projections

re = name of Python's regular expression module re
RE = an acronym for Regular Expression

(and in python, you write an RE in the form of a string)

* to USE the re module, you of course first import it:

    import re

* when you call re.compile with a string,
  you compile that regular expression string to get a regular expression object

* you can THEN call that regular expression's object's METHODS as you wish...

* here are a few of the methods available for a regular expression object:

    * match()

      my_re_obj.match("hello")

      determine if the RE matches at the beginning of the argument string

    * search()

      determine if the RE matches ANYWHERE in the argument string

    (BOTH of these return "match objects" if a match is found,
     and return None (the NoneType literal, remember) if NO match is found.

    * findall()

      finds all the substrings where the RE matches, and returns them as a list

* And since an RE object returns a match object with match object methods,
  here are a few of them:

    * group() - returns the string matched by the RE

      >>> my_match_obj = my_re_obj.match('abcdabc')
      >>> my_match_obj.group()
      'abc'

    * start() - returns the starting position of the match
    * end()   - returns the ending position of the match
    * span()  - returns a tuple containing the start and end positions of the match

    ...these END positions are SLICE-friendly: end position is one PAST the last character
       matched!

* a COMPLETE list of the re module RE metacharacters (acc. to Kuchling, at least):
 
      .  ^   $  *  +  ?  { }   [ ]   \  |  ( )

* [ ] specify a character class ---

  you are saying you want to match ONE character in this class 

  [aeiou]

    * you can indicate a range of characters with a dash

      [a-f]  is the same as [abcdef]

      [0-9]  is the same as [0123456789]

      [a-z]  is all the lowercase letters
 
      [A-Z]  is all the uppercase letters

* what if you want to match a backslash, or square bracket?

  \\   will do it,
  \[
  \]

* ^ as the FIRST THING in a character class means match any character NOT in this
  set

  [^aeiou]  -- mean any character that isn't a lowercase vowel

* some predefined special sequences, for your convenience

  (in both RE's "by themselves" and within character classes)

    \d      any decimal digit, [0-9]

    \D      any NON digit, [^0-9]

    \s      matches any whitespace character, [ \t\n\r\f\v]

    \S      matches any non-whitespace character  [^ \t\n\r\f\v]

    \w      matches any alphanumeric character  [a-zA-Z0-9_]

    \W	    matches any non-alphanumeric character [^a-zA-Z0-9_]

* metacharacter . matches ANY single character

    'd.g'   - d, then any single character, then g

    *   . doesn't match newline, however, UNLESS you specify a particular
        mode (DOTALL) --- see Kuchling tutorial for more on this.
   
* metachacter *

  repeat WHAT PRECEDES IT 0 or more times!!

  'ca*t'    ct  cat    caaat caaaaaaat 

  c, followed by 0 or more a's, followed by t

  + : 1 or more of what precedes it
  ? : 0 or 1 of what precedes it     ("optional")

     'home[ -]?brew' --- matches 'homebrew', 'home-brew', 'home brew'
 
* note that *-matching is greedy --- matches the largest substring it can
  (and then back-tracks if necessary, to satisfy the overall RE...)