^
#######################################################
#
# display-notes related to Intro to Perl, Class 7
#
#######################################################

# last modified: 5-2-03, pre-class

# today's topic - regular expressions, part 2 (ref: "Learning Perl", ch. 7-9)

#-------------------------------------------------------------------
# s/// - substitution operator - CORRECTION and more!

*   FIRST: please note the following CORRECTION, also now made to the
electronic version of 180class6_notes.txt:

s///

s/old/new/

[CORRECTED 5-2-03!!!]
*    the FIRST instance of old is replaced with new

*   ... in $_

*   see class7_01_subst1

while (<>)
{
    chomp;

    #   substitute the *first* 'perl' in each line with 'PERL!'
    s/perl/PERL!/;

    printf "NEW VERSION:<%s>\n", $_;
}

*   consider fodder01:

perl is a language.
a language is perl.
perl perl perl.

*   'class7_01_subst1 fodder01' results in:

NEW VERSION:<PERL! is a language.>
NEW VERSION:<a language is PERL!.>
NEW VERSION:<PERL! perl perl.>

********
* SO: what if you DO want to substitute for ALL of them?
*
* /g modifier to s///
********

*   you can put the /g modifier at the end of s/// --- that is,

s/perl/PERL!/g

...will replace EVERY perl in the line with PERL!, *globally*

    * (yes, the g stands for global, here)
 
*   see class7_02_subst_global and fodder01

*   THREE classic uses of s///g:

    *   "collapse" multiple consecutive whitespace characters in $_ 
    into a single space:     
                    
    s/\s+/ /g;

    *   replace "leading" whitespace (at the beginning of $_)
    with nothing: (remember the use of anchor ^ to mean the beginning
    of something?)
                  
    s/^\s+//g;

    *   replace "trailing" whitespace (at the end of $_)
    with nothing: (remember the use of anchor $ to mean the end of
    something?)
            
    s/\s+$//;

    *   see class7_03_subst_whitespace

*   another useful thing: s///'s return value is true if a substitution
was successful, and false otherwise. That means you can branch or loop
based on whether substitution was done:

    *   see class7_04_subst_count and fodder04

while (<>)
{
    chomp;

    # let's keep count of how many lines actually have substitutions
    # made to them (let's replace trailing white space with nothing)
    if (s/\s+$//)
    {
        $num_changed++;
    }

    #   substitute the EVERY 'perl' in each line with 'PERL!'
    s/perl/PERL!/g;

    printf "VERSION AFTER SUBST:<%s>\n", $_;
}

print "\nNumber of lines with trailing whitespace removed: $num_changed\n\n";

*   (and, yes, you can use something besides /// --- BUT if "paired",
you've got to pair them here, too. Consider: these four perform the SAME
substitution, replacing the FIRST perl in $_ with PERL!:

s/perl/PERL!/;

s#perl#PERL!#;

s{perl}{PERL!};

s<perl>{PERL!};

#--------------------------------------------------------------------
# "word" anchors

*   ...because sometimes you want to match "words" that begin or end
with something, not just lines that do so;
       
*   \b is the word boundary anchor --- it matches at the start or end
of a group of \w characters (ordinary letters, digits, and underscores,
NOTE!)
             
   /\bperl\b/   #   will match perl, but not perlish or unperl or unperlish

   *   it will also match "perl" or 'perl', because it is looking for
   \w "word" characters (ordinary letters, digits, underscores),
   which quotes are not ... so be careful! 8-)

#-------------------------------------------------------------------
# more on | (vertical bar)

*   noted last time that | means "or"; if the pattern on either side of |
matches, then it is considered matched;

*   /Perl|perl|PERL/ matches Perl or perl or PERL...

*   can "nest" this inside a pattern within parentheses, too ---

    *   what if you want to match "ham and eggs" OR "ham plus eggs"?

    /ham (and|plus) eggs/     # ...should do it;

#--------------------------------------------------------------------
# simple *UNIX* output to a file...

*   consider: so far, we've been "dumping" our output to the screen;

*   what if you'd like it in a file, instead?

*   SIMPLE UNIX WAY: when you CALL the script, you can redirect
its output to a file with > (output redirection)
                                     
class7_04_subst_count fodder04	# result goes to screen
                                             
class7_04_subst_count fodder04 > my_result  # result goes into file my_result

    *   NOTES:

    *   if you do > on an EXISTING file, the new results OVERWRITE the
    old;

    class7_04_subst_count fodder04 > my_result  # results from fodder04
    class7_04_subst_count fodder01 > my_result  # JUST results from fodder01
                                                           
    *   to APPEND new results to the END of a file, use >> instead:

    class7_04_subst_count fodder04 > my_result  # results from fodder04
    class7_04_subst_count fodder01 >> my_result # fodder01 results APPENDED

*   what --- you want file output but you aren't in UNIX? 
    *   that's a topic for Monday! 8-)

#----------------------------------------------------------------------
# more on character classes []

*   caret (^) can be used for negation for any characters in a character
class, not just the character class shortcuts (such as \w for word,
\d for digit, etc.) mentioned last time;
                                          
*   BUT note: what do you think /[^def]/ is saying?
    *   ..."match[] any single character EXCEPT one of those three"
    ("Learning Perl", p. 106)

    *   that is, it negates ALL the characters within that character
    class;

*   match in $_ that DOESN'T begin with d,e, or f and has a vowel AFTER
that first non-d-e-or-f

    /^[^def][aeiouAEIOU]/ 

*   additional shortcuts for negating the shortcuts...!
    *   \D is the same as [^\d]
    *   \S is the same as [^\s]
    *   etc...!

*   and remember: these [anything] match a SINGLE character (that is
one of those included in the [], or NOT included, if ^ is within those
[]'s);
    *   to get more? good old +, * (and ? for 0 or 1...)

    \s*   # any amount of whitespace --- including none at all!

    \s+   # at least one whitespace character --- maybe more;

    \s?   # none or one whitespace character   

#------------------------------------------------------------------------
# general quantifiers

*   paraphrasing a question from last time: if I want to match an exact
number of something, do I have to repeat it that many times in a pattern?

*   example: I want to match "words" of exactly 5 letters --- do I need
to type: (the first such word within $_)

   /\b[A-Za-z][A-Za-z][A-Za-z][A-Za-z][A-Za-z]\b/

*   no --- a general quantifier, a comma-separated pair of numbers inside
of curly braces, can say exactly how few and how many repetitions are allowed;

   /\b[A-Za-z]{5,5}\b/

*   examples:

    /\b[A-Za-z]{3,5}\b/   # words with between 3 and 5 letters

    /\b[A-Za-z]{6,}\b/	# words with 6 or more letters (no upper limit)

    /\b[A-Za-z]{6}\b/	# shortcut for {6,6} --- exactly 6 letters

*   so, note that * is really just a shortcut for {0,},
                   + is really just a shortcut for {1,},
		   ? is really just a shortcut for {0,1}

#-----------------------------------------------------------------------
# memory parentheses

*   () is more than for grouping for repetitions...

*   they ALSO tell the regular expression engine to REMEMBER what was
matched!

*   /./   # match a single character
    /(.)/ # match a single character, and REMEMBER it!

*   you get one "regular expression memory" for each pair of parentheses
in a pattern;

#-----------------------------------------------------------------------
# backreferences

* ...and you can access that regular expression memory with backreferences
(that's one way, anyway...)

*   \1 contains the first regular expression memory ---
*   \2 contains the second regular expression memory ---
*   etc.! (up to how many ()'s are in the regular expression...)

*   so:   

    /../	# matches ANY character followed by any OTHER character
    /(.)\1/     # matches a character followed by the SAME character

*   (and, okay --- AFTER the regular expression is done, you can access
these regular expression memories with: $1, $2, etc.)

*   so, this finds the first 3-letter "word" in each line of a file:

*   see class7_06_find_first_threes

while (<>)
{
    chomp;

    if (/(\b[A-Za-z]{3}\b)/)
    {
        print "$1\n";
    }

}

#----------------------------------------------------------------------
# matches with m///

*   our matches so far are a shortcut, too --- for match operator, m//!

*   (but you don't need the m if you use /'s...)

    *   yes, these should be ok too?

    m#/hello/#

*   and there are more option modifiers for it, too! (like /g for s/// ---
and they WORK for s///, too...)

    *   for example:

    *   /i  -- case-insensitive matching;
                     
        /\byes/b/i  #  will match yes written in ANY case!

#------------------------------------------------------------------
# more on the binding operator, =~

*   oh yes --- and =~ can be used with s///, too:

$file_name =~ s/\.pl$/\.plx/;   # subst in $file_name instead of $_

# end of 180class7_notes.txt