#######################################################
#
# display-notes related to Intro to Perl, Class 6
#
#######################################################

# last modified: 4-28-03, in-class

# today's topic - regular expressions, part 1 (ref: "Learning Perl", ch. 7-9)

#-------------------------------------------------------------------
# regular expressions

*   this isn't *exactly* how a theoretician would define it --- but,
    in Perl, ("Learning Perl", p. 98):
    
    *   "A regular expression, often called a pattern in Perl,
        is a template that either matches or doesn't match a
	given string."

*   you write a regular expression --- then, any given string either
    matches that regular expression, or it doesn't.

*   (Perl's regular expressions are *not* the same as UNIX shell-style
    *filename*-matching patterns, also called globs. More on those 
    later...)

#----------------------------------------------------------------------
# using simple patterns

*   assume that variable $_ (remember? Perl's favorite default?) holds
a string, and you want to see if another string is within the contents of
$_;

        *   if you type that string inside of // (forward slashes),
	then that is a regular expression that is true if the
	letters inside // are in $_ --- it is false otherwise.

	*   (since it is an expression that is true or false ---
	where do you suppose it is often used?
	        *   you got it --- it is often used as the
		    conditional expression for an if or while statement!)

# see class6_01_patt1

# using simple patterns --- see which lines passed to
# it contain the string 'perl' within them, and display them
# to the screen.

while (<>)
{
    chomp;

    # see if $_ contains the letters 'perl' within it
    if (/perl/)
    {
        printf "PERL LINE:<%s>\n", $_;
    }
}

*   oh, yes --- and, ("Learning Perl", p. 100) "All of the usual backslash
escapes that you can put into double-quoted strings are [also] available
in patterns";

        *   \n for newline, \t for tab, all those lovely ones from
	table 2-1 on p. 24, "Learning Perl";

#   see class6_02_patt2

# using simple patterns with backslash escapes, to demo that they
# work --- see which lines passed to it contain a tab character,
# and display them to the screen.

while (<>)
{
    chomp;

    # see if $_ contains at least one tab character within it
    if (/\t/)
    {
        printf "LINE W/TAB:<%s>\n", $_;
    }
}

#------------------------------------------------------------------------
# metacharacters

*   ...because just matching particular characters wouldn't be *that*
useful...

*   metacharacters are characters that have special meanings in
regular expressions; (characters "describing" characters!)

*****
* . (dot wildcard)

*****

*   dot (.) is a wildcard --- it matches any SINGLE character EXCEPT for
a newline ("\n")

*   (you can match JUST a dot by putting a backslash before it --- \. )

#   see class6_03_wildcard1

# using simple patterns with wildcard .
#
# see which lines contain mist followed by any other character followed
# by r --- also see which contain 9.9, and which contain a backslash

while (<>)
{
    chomp;

    # see if $_ contains mist, any character, r
    if (/mist.r/)
    {
        printf "line with mist.r:<%s>\n", $_;
    }

    # see if $_ contains 9.9
    if (/9\.9/)
    {
        printf "line with 9.9:<%s>\n", $_;
    }

    # see if $_ contains a backslash
    if (/\\/)
    {
        printf "line with a backslash:<%s>\n", $_;
    }
}


#-------------------------------------------------------------
 * (star quantifier)

*   beware! it has a DIFFERENT meaning in a Perl regular expression
than in globbing!!!

*   it does NOT mean match 0-or-more-characters ---

*   INSTEAD: it means "match the PRECEDING item 0 or more times"!
                                 
*   /oil change\t12.99/ matches "oil change" followed by a tab
followed by 12.99 ---
    *   /oil change\t*12.99/ matches "oil change" followed by 
    ANY number of tabs --- including 0! --- followed by 12.99

#   see class6_04_quantifier1

# using simple patterns with quantifier *
#
# see which lines contain "oil change" followed by ANY number of
# tabs followed by 12.99

while (<>)
{
    chomp;

    # see if $_ contains "oil change", any number of tabs, 12.99
    if (/oil change\t*12.99/)
    {
        printf "line with oil change tabs 12.99:<%s>\n", $_;
    }

}

*   (see class6_05_pattgrabber for a FLAKY test-script to let you enter
a desired pattern --- special characters behave erratically, however!!)

*   star isn't the ONLY so-called QUANTIFIER, either ---
   
#-----------------------------------------------------------------------
# + (plus quantifier)

*   + after a character? That means, match the preceding character ONE or
more times;
                                       
*   /oil change\t+12.99/ means at LEAST one tab must appear between
"oil change" and 12.99.

#--------------------------------------------------------------------
 "any old junk" pattern: .*

*   so, if . matches ANY single character, 

*   ...and * means, 0 or more of the previous thing,

*   ...guess what .* means?
    *   happily, it means match 0 or more of any characters!

    *   (they don't have to be the same)
                                    
*   /oil change.*12.99/ will match ANY string that has "oil change"
earlier and 12.99 later, no matter WHAT is in between (and even if nothing
is in-between);

    *   (and /oil change.+12.99/ says at LEAST ONE character is between
    the oil change and the 12.99)

#-----------------------------------------------------------------------
 ? (question-mark quantifier)

*   and THIS means that the previous item is OPTIONAL;

*   (that is, once or not at all;)

*   nice if you are not SURE that two things are separated by something;
I'm not sure if text I am looking for says elseif or else-if:
                  
    /else-?if/	     # will match EITHER else-if OR elseif

#----------------------------------------------------------------------
 matching more than 1 character

*   ...because what if you don't want to match ha, haa, haaa, etc., but
ha, haha, hahaha, etc.?

*   like math: can use parentheses for grouping!
      
/ha*/     # matches any string including h, ha, haa, haaa, etc.

/(ha)*/   # matches any string including "", ha, haha, hahaha, etc.
          # (will end up matching ANY string, note!!)

/ha+/     # matches any string including ha, ha, haa, haaa, etc. 
	  # (DOESN'T match a string with just h)
                  
/(ha)+/   # matches any string including ha, haha, hahaha, etc.

#------------------------------------------------------------------------
 binding operator: =~

*   ...for when you want to compare something besides $_

*   $str =~ /patt/
    ...is true if /patt/ matches $str, and false otherwise;


#-------------------------------------------------------------------------
 split operator

*   yeah, this can be used for HW #5, #3 (now, amazingly enough, 
the *slightly-adapted* HW #6, #1!)
                                   
@fields = split /separator/, $string;

# @line_vals has a separate value for each value after a tab
@line_vals = split /\t/, $whole_line

# see class6_07_split_ex

# @line_vals has a separate value for each value after a :
@line_vals = split /:/, "the rain:in Spain:stays mainly:in the plain";

foreach $val (@line_vals)
{
    print "NEXT LINE VAL:$val\n";
}


#-------------------------------------------------------------------
 alternatives

*   | (vertical bar)

*   /fred|barney/
    *   left side may match, OR right side may match

    *   if has fred, it matches!
    *   if has barney, it matches!

*   /Perl|perl/ would match perl starting with either P or p
(but there are "better" ways to write that...)

#------------------------------------------------------------------
 []

*   matches ANY single character within the []

    *   so, here's a "better" way for /Perl|perl/

    /[Pp]erl/

    *   any vowel followed by an x:

    /[aeiouAEIOU]x/

*   CAN have a range --- a-f

    *  match a letter w, x, y, or z

    /[w-z]/

    *   any letter?

    /[A-Za-z]/

*   \d - any digit

    *   [\d] is the same as [0-9]

*   \w - shortcut for "word characters": [A-Za-z0-9_]

    *   so, that's [\w]

*   \s - shortcut for "space characters": [\f\t\n\r ]

    *   so, that's [\s]

*   NOT a digit? [^\d]

*   NOT whitespace? [^\s]     NOT a word char? [^\w]

#-------------------------------------------------------------------
 anchors

*   ^ --- must match at BEGINNING of string

    *   /^hello/ only matches strings that BEGIN with hello!!!

*   $ --- must match at END of string

    *   /hello[\s]*$/ only matches strings that end in hello with
    optional whitespace after
                
*   /^hello$/ only matches strings that are exactly "hello", NOTHING
before OR after!

#-------------------------------------------------------------------
 substitutions

s///

s/old/new/

[CORRECTED 5-2-03!!!]
*    the FIRST instance of old is replaced with new

*   ... in $_

#------------------------------------------------------------------
 and: MORE fun with regular expressions on Friday!

# end of 180class6_notes.txt