CS 279 - Week 9 Lecture 1 - 2022-10-17

TODAY WE WILL
*   announcements/reminders
*   ASIDE: one of several ways to read all the lines in a file
*   ASIDE: another place to use REs: the =~ operator
*   continuing Linux/UNIX REs:
    *   a few more BRE options
    *   a few Extended RE (ERE) options
*   prep for next class

*   Should be working on Homework 5!

*   Current Reading:
    *   LDP Bash Beginners' Guide - Chapter 4 - Regular Expressions
    *   2021 course text: Section II - Chapter 19 - Sections 19.1, 19.2
        *   has SOME of the BRE material

=====
ASIDE: one of several ways to read all the lines in a given file
====
*   Bash has an odd while loop version that can
    read all the lines from a file:

    while read desired_line_variable
    do
        ... $desired_line_variable ...
    done < desired_file_name

    *   it also seems to strip leading and trailing blanks...?

=====
ASIDE: another place to use REs: the =~ operator
=====
*   FUN FACT: you can use an RE in an if-statement's test
    by using the =~ operator!

*   NOTES:
    *   this is a test expression that needs to be in [[ ]]
        (not single [ ] )
	
    *   syntax:
        [[ given_string =~ desired_re ]]

        *   this will be true of given_string matches
	    the given desired_re

        *   do NOT put quotes around the desired_re here

=====
a few more BRE options
=====
subexpressions!
=====
*   answer the question: what if you want to find
    a pattern with a REPEATED bit within it?

*   you can enclose a portion of an RE between the markers
    \(   \)

    *   this construct -- the \( and pattern and \)
        is called a SUBEXPRESSION

    *   LATER in the enclosing pattern you can MATCH this
        by writing a BACKREFERENCE \n where n is a digit
	between 1 and 9

*   examples:

    g\([a-z]*\)\&\1

    grep "g\([a-z]*\)\&\1" play.txt
    *   matches:
        g&
	gnat&nat
        oooogoober&ooberahhhh

    *   will NOT match:
        ga&b
	nat&nat

====
interval expressions
====
*   these are good for when you want to match a definite
    number of things (not just 0, 1, or many...)

    ...works for 0-or-1 also...

*   You can follow a single character,
    or an RE denoting a single character,
    by one of the following forms, called an INTERVAL expression:

    \{m\}   \{m,\}  \{m, n\}

    *   here, m and n must be NON-NEGATIVE integers LESS THAN 256

    *   If S is the set containing EITHER the single character
        OR the characters that match the RE,

        \{m\} - denotes EXACTLY m occurrences of characters belonging
	        to S

                [0-9]\{2\} - matches a sequence of exactly 2 digits

        \{m,\} - denotes AT LEAST m occurrences of characters belonging
	         to S

                [0-9]\{2,\} - matches a sequence of 2 OR MORE digits

        \{m,n\} - denotes BETWEEN m and n occurrences (inclusive)
	          of characters belonging to S

		[0-9]\{2,4\} - matches a sequence of 2, 3, or 4 digits

=====
EREs - EXTENDED regular expressions
=====
*   note: these are extensions on the BRE syntax --
    they do not work everywhere that BREs work,
    so beware!

*   for example, to use them with grep,
    you can use egrep or grep -E

*   an ERE follows the rules for a BRE with the following
    ADDITIONS and CHANGES:
    *   two REs separated by a | match an occurrence of
        EITHER of them (that is, this acts like or)

    *   UNQUOTED parentheses ("plain" parentheses...?)
        are used for GROUPING subexpressions --

	catfish catfight dogfish dogfight

        I can use the ERE:	 (cat|dog)(fish|fight)

        for example:
        egrep "(cat|dog)(fish|fight)" play.txt
	grep -E "(cat|dog)(fish|fight)" play.txt

    *   e+ matches 1 or more occurrences of an ERE e
    
        where e must be either a parenthesized subexpression
        OR an ERE that always matches exactly one character

	[A-Z][0-9]+ -- matches strings with an uppercase
	               character followed by 1 or more digits

    *   e? - matches zero or one occurrences of the ERE E

        [abc]?[0-9] - matches zero or one of a or b or c
	              followed by exactly one digit