####################################################### # # display-notes related to Intro to Perl, Class 6 # ####################################################### # last modified: 4-28-03, in-class # today's topic - regular expressions, part 1 (ref: "Learning Perl", ch. 7-9) #------------------------------------------------------------------- # regular expressions * this isn't *exactly* how a theoretician would define it --- but, in Perl, ("Learning Perl", p. 98): * "A regular expression, often called a pattern in Perl, is a template that either matches or doesn't match a given string." * you write a regular expression --- then, any given string either matches that regular expression, or it doesn't. * (Perl's regular expressions are *not* the same as UNIX shell-style *filename*-matching patterns, also called globs. More on those later...) #---------------------------------------------------------------------- # using simple patterns * assume that variable $_ (remember? Perl's favorite default?) holds a string, and you want to see if another string is within the contents of $_; * if you type that string inside of // (forward slashes), then that is a regular expression that is true if the letters inside // are in $_ --- it is false otherwise. * (since it is an expression that is true or false --- where do you suppose it is often used? * you got it --- it is often used as the conditional expression for an if or while statement!) # see class6_01_patt1 # using simple patterns --- see which lines passed to # it contain the string 'perl' within them, and display them # to the screen. while (<>) { chomp; # see if $_ contains the letters 'perl' within it if (/perl/) { printf "PERL LINE:<%s>\n", $_; } } * oh, yes --- and, ("Learning Perl", p. 100) "All of the usual backslash escapes that you can put into double-quoted strings are [also] available in patterns"; * \n for newline, \t for tab, all those lovely ones from table 2-1 on p. 24, "Learning Perl"; # see class6_02_patt2 # using simple patterns with backslash escapes, to demo that they # work --- see which lines passed to it contain a tab character, # and display them to the screen. while (<>) { chomp; # see if $_ contains at least one tab character within it if (/\t/) { printf "LINE W/TAB:<%s>\n", $_; } } #------------------------------------------------------------------------ # metacharacters * ...because just matching particular characters wouldn't be *that* useful... * metacharacters are characters that have special meanings in regular expressions; (characters "describing" characters!) ***** * . (dot wildcard) ***** * dot (.) is a wildcard --- it matches any SINGLE character EXCEPT for a newline ("\n") * (you can match JUST a dot by putting a backslash before it --- \. ) # see class6_03_wildcard1 # using simple patterns with wildcard . # # see which lines contain mist followed by any other character followed # by r --- also see which contain 9.9, and which contain a backslash while (<>) { chomp; # see if $_ contains mist, any character, r if (/mist.r/) { printf "line with mist.r:<%s>\n", $_; } # see if $_ contains 9.9 if (/9\.9/) { printf "line with 9.9:<%s>\n", $_; } # see if $_ contains a backslash if (/\\/) { printf "line with a backslash:<%s>\n", $_; } } #------------------------------------------------------------- * (star quantifier) * beware! it has a DIFFERENT meaning in a Perl regular expression than in globbing!!! * it does NOT mean match 0-or-more-characters --- * INSTEAD: it means "match the PRECEDING item 0 or more times"! * /oil change\t12.99/ matches "oil change" followed by a tab followed by 12.99 --- * /oil change\t*12.99/ matches "oil change" followed by ANY number of tabs --- including 0! --- followed by 12.99 # see class6_04_quantifier1 # using simple patterns with quantifier * # # see which lines contain "oil change" followed by ANY number of # tabs followed by 12.99 while (<>) { chomp; # see if $_ contains "oil change", any number of tabs, 12.99 if (/oil change\t*12.99/) { printf "line with oil change tabs 12.99:<%s>\n", $_; } } * (see class6_05_pattgrabber for a FLAKY test-script to let you enter a desired pattern --- special characters behave erratically, however!!) * star isn't the ONLY so-called QUANTIFIER, either --- #----------------------------------------------------------------------- # + (plus quantifier) * + after a character? That means, match the preceding character ONE or more times; * /oil change\t+12.99/ means at LEAST one tab must appear between "oil change" and 12.99. #-------------------------------------------------------------------- "any old junk" pattern: .* * so, if . matches ANY single character, * ...and * means, 0 or more of the previous thing, * ...guess what .* means? * happily, it means match 0 or more of any characters! * (they don't have to be the same) * /oil change.*12.99/ will match ANY string that has "oil change" earlier and 12.99 later, no matter WHAT is in between (and even if nothing is in-between); * (and /oil change.+12.99/ says at LEAST ONE character is between the oil change and the 12.99) #----------------------------------------------------------------------- ? (question-mark quantifier) * and THIS means that the previous item is OPTIONAL; * (that is, once or not at all;) * nice if you are not SURE that two things are separated by something; I'm not sure if text I am looking for says elseif or else-if: /else-?if/ # will match EITHER else-if OR elseif #---------------------------------------------------------------------- matching more than 1 character * ...because what if you don't want to match ha, haa, haaa, etc., but ha, haha, hahaha, etc.? * like math: can use parentheses for grouping! /ha*/ # matches any string including h, ha, haa, haaa, etc. /(ha)*/ # matches any string including "", ha, haha, hahaha, etc. # (will end up matching ANY string, note!!) /ha+/ # matches any string including ha, ha, haa, haaa, etc. # (DOESN'T match a string with just h) /(ha)+/ # matches any string including ha, haha, hahaha, etc. #------------------------------------------------------------------------ binding operator: =~ * ...for when you want to compare something besides $_ * $str =~ /patt/ ...is true if /patt/ matches $str, and false otherwise; #------------------------------------------------------------------------- split operator * yeah, this can be used for HW #5, #3 (now, amazingly enough, the *slightly-adapted* HW #6, #1!) @fields = split /separator/, $string; # @line_vals has a separate value for each value after a tab @line_vals = split /\t/, $whole_line # see class6_07_split_ex # @line_vals has a separate value for each value after a : @line_vals = split /:/, "the rain:in Spain:stays mainly:in the plain"; foreach $val (@line_vals) { print "NEXT LINE VAL:$val\n"; } #------------------------------------------------------------------- alternatives * | (vertical bar) * /fred|barney/ * left side may match, OR right side may match * if has fred, it matches! * if has barney, it matches! * /Perl|perl/ would match perl starting with either P or p (but there are "better" ways to write that...) #------------------------------------------------------------------ [] * matches ANY single character within the [] * so, here's a "better" way for /Perl|perl/ /[Pp]erl/ * any vowel followed by an x: /[aeiouAEIOU]x/ * CAN have a range --- a-f * match a letter w, x, y, or z /[w-z]/ * any letter? /[A-Za-z]/ * \d - any digit * [\d] is the same as [0-9] * \w - shortcut for "word characters": [A-Za-z0-9_] * so, that's [\w] * \s - shortcut for "space characters": [\f\t\n\r ] * so, that's [\s] * NOT a digit? [^\d] * NOT whitespace? [^\s] NOT a word char? [^\w] #------------------------------------------------------------------- anchors * ^ --- must match at BEGINNING of string * /^hello/ only matches strings that BEGIN with hello!!! * $ --- must match at END of string * /hello[\s]*$/ only matches strings that end in hello with optional whitespace after * /^hello$/ only matches strings that are exactly "hello", NOTHING before OR after! #------------------------------------------------------------------- substitutions s/// s/old/new/ [CORRECTED 5-2-03!!!] * the FIRST instance of old is replaced with new * ... in $_ #------------------------------------------------------------------ and: MORE fun with regular expressions on Friday! # end of 180class6_notes.txt