279lect08-2-projected


CS 279 - Week 8 Lecture 2 - 2022-10-11

TODAY WE WILL:
*   announcements/reminders
*   a bit more on grep
*   start discussion of UNIX/Linux BRE
    (basic regular expressions)
*   prep for next class

=====
a bit more on grep
=====
*   DEFAULT behavior:
    *   if don't provide the options -nclq and provide just ONE
        file, grep's output is JUST the matching lines

    *   if you provide more than one file,
        grep's output is specified by the pathname of the
	file as it was specified on the command line

*   no surprise, grep has LOTS of options -- here are just a
    few:
    *   -n  precede each matching line by its file name and
            line number

    *   -c  only show a count of the matching lines
            (you'll get 0's for files with no matches!)

    *   -l  only show the names of the files containing matching
            strings (nice for building a file list for a for loop!)

    *   -s  suppress error messages for nonexistent or
            unreadable files

    *   -q  run quietly -- don't write ANYTHING to standard
            output (!!), but exit with exit status 0 if any
	    any input lines are selected (so, testable after the
	    fact)

    *   -v  select the lines that DON'T match!

    *   -i  ignore the case of letters in making comparisons

=====
Basic Regular Expressions (BREs) (UNIX/Linux style)
=====
*   regular expression: defines a pattern of text to be matched

*   several UNIX/Linux utilities expect you to specify patterns
    as regular expressions (REs)
    *   grep, sed, ed, several others

    *   2 basic categories:
        Basic Regular Expressions (BREs)
	*   understood by "older" UNIX programs (such as ed, grep, sed)
	
	Extended Regular Expressions (EREs)
        *   an extension of REs recognized by egrep (same as grep -E)

=====
BREs
=====
*   first: in general, any non-special character in a BRE matches that
    character in the text

    grep oink *.txt   # find lines with o then i then n then k within them

*   SOME special characters are special ANYWHERE they appear in a pattern

    .   *   [   \

*   SOME special characters are ONLY special under particular conditions
    *   ^ is special only if it appears at the beginning of a pattern

    *   $ is special at the END of a pattern

    *   the character that terminates a pattern is special throughout
        that pattern

*   you CAN escape special character's meanings -- and just match
    that character -- by escaping it with a backslash

    'cheap at $9.98'   # but the . is special! How match . specifically?

    'cheap at $9\.98'  # now can match . specifically

    (and the $ is only special if it is at the END of a pattern)

*   \ - escapes the special meaning of the character following it
        IF it is special

        what if the following character is not special?
	... yikes, backslash behavior is UNDEFINED in that case,
	so please try to avoid that in BREs!

*   . - the dot matches any single non-null character
        (so, the BRE version of globbing's ?)

        IN CLASS, this BRE needed to be in double-quotes to work
	as expected (within grep, at least):

	grep "o\.n" animals.txt  # to JUST match o, then dot, then n

*   ^ - this character at the BEGINNING of the OUTERMOST RE matches
        the BEGINNING of a line (anywhere else, ^ matches ^)

        ...that is, we want lines that START with some pattern

*   $ - this character at the END of an OUTERMOST RE matches
        the END of a line (anywhere else, $ matches $)

        ...that is, we want lines that START with some pattern

*   * - * has a slightly DIFFERENT MEANING in REs than in globbing!

        in REs, * goes with the character preceding it --
	matches 0 or more instances of THAT character

        grep ab*c *.txt   # matches lines with an a
	                  #    followed by 0 or more bs
			  #    followed by a c

        * can also follow a set of characters in square brackets

        [moxie]  - matches one of m or o or x or i or e
	[moxie]* - matches 0 or more m's o's x's i's or e's in ANY combinatn

        we tried:

        grep "^m[moxie]*$" moxie-play.t

	...and a line:
	mmmmmmmoxmxxeeie
	...DID match it
	
	NOTE: beware of a pattern that is JUST one character
	followed by *

	a*

	...that will match ANY line!!!! (asks for a line with 0 or more a's)

        Do you really want 1 or more?

	aa* will match a line with 1 or more lowercase a's

*   [set] - a set of characters in square brackets matches any
            single character from that set
	    ...my reference for this called this a BRACKET EXPRESSION

            *   ranges are allowed

                [c1-c2] matches any one of of the set of characters
		        in the range c1 to c2, inclusive

            *   there are some special classes similar to those we
	        saw in globbing, written as [[:desired_class:]]

		[[:lower:]] - matches one lowercase letter


*   NOTE: the matching mechanism for BREs in UNIX/Linux is clever enough
    to consider the whole line when testing for a match --

    ^a.*b.c$   # line starts with a
                 then 0 or more of anything
		 then b
		 then any one character
		 then c at the end of the line

    axybbcc will match this;

    general rule: when a BRE can be matched in more that one way,
                  the longest possible matching sequence will be used