^ ####################################################### # # display-notes related to Intro to Perl, Class 7 # ####################################################### # last modified: 5-2-03, pre-class # today's topic - regular expressions, part 2 (ref: "Learning Perl", ch. 7-9) #------------------------------------------------------------------- # s/// - substitution operator - CORRECTION and more! * FIRST: please note the following CORRECTION, also now made to the electronic version of 180class6_notes.txt: s/// s/old/new/ [CORRECTED 5-2-03!!!] * the FIRST instance of old is replaced with new * ... in $_ * see class7_01_subst1 while (<>) { chomp; # substitute the *first* 'perl' in each line with 'PERL!' s/perl/PERL!/; printf "NEW VERSION:<%s>\n", $_; } * consider fodder01: perl is a language. a language is perl. perl perl perl. * 'class7_01_subst1 fodder01' results in: NEW VERSION: NEW VERSION: NEW VERSION: ******** * SO: what if you DO want to substitute for ALL of them? * * /g modifier to s/// ******** * you can put the /g modifier at the end of s/// --- that is, s/perl/PERL!/g ...will replace EVERY perl in the line with PERL!, *globally* * (yes, the g stands for global, here) * see class7_02_subst_global and fodder01 * THREE classic uses of s///g: * "collapse" multiple consecutive whitespace characters in $_ into a single space: s/\s+/ /g; * replace "leading" whitespace (at the beginning of $_) with nothing: (remember the use of anchor ^ to mean the beginning of something?) s/^\s+//g; * replace "trailing" whitespace (at the end of $_) with nothing: (remember the use of anchor $ to mean the end of something?) s/\s+$//; * see class7_03_subst_whitespace * another useful thing: s///'s return value is true if a substitution was successful, and false otherwise. That means you can branch or loop based on whether substitution was done: * see class7_04_subst_count and fodder04 while (<>) { chomp; # let's keep count of how many lines actually have substitutions # made to them (let's replace trailing white space with nothing) if (s/\s+$//) { $num_changed++; } # substitute the EVERY 'perl' in each line with 'PERL!' s/perl/PERL!/g; printf "VERSION AFTER SUBST:<%s>\n", $_; } print "\nNumber of lines with trailing whitespace removed: $num_changed\n\n"; * (and, yes, you can use something besides /// --- BUT if "paired", you've got to pair them here, too. Consider: these four perform the SAME substitution, replacing the FIRST perl in $_ with PERL!: s/perl/PERL!/; s#perl#PERL!#; s{perl}{PERL!}; s{PERL!}; #-------------------------------------------------------------------- # "word" anchors * ...because sometimes you want to match "words" that begin or end with something, not just lines that do so; * \b is the word boundary anchor --- it matches at the start or end of a group of \w characters (ordinary letters, digits, and underscores, NOTE!) /\bperl\b/ # will match perl, but not perlish or unperl or unperlish * it will also match "perl" or 'perl', because it is looking for \w "word" characters (ordinary letters, digits, underscores), which quotes are not ... so be careful! 8-) #------------------------------------------------------------------- # more on | (vertical bar) * noted last time that | means "or"; if the pattern on either side of | matches, then it is considered matched; * /Perl|perl|PERL/ matches Perl or perl or PERL... * can "nest" this inside a pattern within parentheses, too --- * what if you want to match "ham and eggs" OR "ham plus eggs"? /ham (and|plus) eggs/ # ...should do it; #-------------------------------------------------------------------- # simple *UNIX* output to a file... * consider: so far, we've been "dumping" our output to the screen; * what if you'd like it in a file, instead? * SIMPLE UNIX WAY: when you CALL the script, you can redirect its output to a file with > (output redirection) class7_04_subst_count fodder04 # result goes to screen class7_04_subst_count fodder04 > my_result # result goes into file my_result * NOTES: * if you do > on an EXISTING file, the new results OVERWRITE the old; class7_04_subst_count fodder04 > my_result # results from fodder04 class7_04_subst_count fodder01 > my_result # JUST results from fodder01 * to APPEND new results to the END of a file, use >> instead: class7_04_subst_count fodder04 > my_result # results from fodder04 class7_04_subst_count fodder01 >> my_result # fodder01 results APPENDED * what --- you want file output but you aren't in UNIX? * that's a topic for Monday! 8-) #---------------------------------------------------------------------- # more on character classes [] * caret (^) can be used for negation for any characters in a character class, not just the character class shortcuts (such as \w for word, \d for digit, etc.) mentioned last time; * BUT note: what do you think /[^def]/ is saying? * ..."match[] any single character EXCEPT one of those three" ("Learning Perl", p. 106) * that is, it negates ALL the characters within that character class; * match in $_ that DOESN'T begin with d,e, or f and has a vowel AFTER that first non-d-e-or-f /^[^def][aeiouAEIOU]/ * additional shortcuts for negating the shortcuts...! * \D is the same as [^\d] * \S is the same as [^\s] * etc...! * and remember: these [anything] match a SINGLE character (that is one of those included in the [], or NOT included, if ^ is within those []'s); * to get more? good old +, * (and ? for 0 or 1...) \s* # any amount of whitespace --- including none at all! \s+ # at least one whitespace character --- maybe more; \s? # none or one whitespace character #------------------------------------------------------------------------ # general quantifiers * paraphrasing a question from last time: if I want to match an exact number of something, do I have to repeat it that many times in a pattern? * example: I want to match "words" of exactly 5 letters --- do I need to type: (the first such word within $_) /\b[A-Za-z][A-Za-z][A-Za-z][A-Za-z][A-Za-z]\b/ * no --- a general quantifier, a comma-separated pair of numbers inside of curly braces, can say exactly how few and how many repetitions are allowed; /\b[A-Za-z]{5,5}\b/ * examples: /\b[A-Za-z]{3,5}\b/ # words with between 3 and 5 letters /\b[A-Za-z]{6,}\b/ # words with 6 or more letters (no upper limit) /\b[A-Za-z]{6}\b/ # shortcut for {6,6} --- exactly 6 letters * so, note that * is really just a shortcut for {0,}, + is really just a shortcut for {1,}, ? is really just a shortcut for {0,1} #----------------------------------------------------------------------- # memory parentheses * () is more than for grouping for repetitions... * they ALSO tell the regular expression engine to REMEMBER what was matched! * /./ # match a single character /(.)/ # match a single character, and REMEMBER it! * you get one "regular expression memory" for each pair of parentheses in a pattern; #----------------------------------------------------------------------- # backreferences * ...and you can access that regular expression memory with backreferences (that's one way, anyway...) * \1 contains the first regular expression memory --- * \2 contains the second regular expression memory --- * etc.! (up to how many ()'s are in the regular expression...) * so: /../ # matches ANY character followed by any OTHER character /(.)\1/ # matches a character followed by the SAME character * (and, okay --- AFTER the regular expression is done, you can access these regular expression memories with: $1, $2, etc.) * so, this finds the first 3-letter "word" in each line of a file: * see class7_06_find_first_threes while (<>) { chomp; if (/(\b[A-Za-z]{3}\b)/) { print "$1\n"; } } #---------------------------------------------------------------------- # matches with m/// * our matches so far are a shortcut, too --- for match operator, m//! * (but you don't need the m if you use /'s...) * yes, these should be ok too? m#/hello/# * and there are more option modifiers for it, too! (like /g for s/// --- and they WORK for s///, too...) * for example: * /i -- case-insensitive matching; /\byes/b/i # will match yes written in ANY case! #------------------------------------------------------------------ # more on the binding operator, =~ * oh yes --- and =~ can be used with s///, too: $file_name =~ s/\.pl$/\.plx/; # subst in $file_name instead of $_ # end of 180class7_notes.txt