CS 279 - Week 8 Lecture 2 - 2022-10-11
TODAY WE WILL:
* announcements/reminders
* a bit more on grep
* start discussion of UNIX/Linux BRE
(basic regular expressions)
* prep for next class
=====
a bit more on grep
=====
* DEFAULT behavior:
* if don't provide the options -nclq and provide just ONE
file, grep's output is JUST the matching lines
* if you provide more than one file,
grep's output is specified by the pathname of the
file as it was specified on the command line
* no surprise, grep has LOTS of options -- here are just a
few:
* -n precede each matching line by its file name and
line number
* -c only show a count of the matching lines
(you'll get 0's for files with no matches!)
* -l only show the names of the files containing matching
strings (nice for building a file list for a for loop!)
* -s suppress error messages for nonexistent or
unreadable files
* -q run quietly -- don't write ANYTHING to standard
output (!!), but exit with exit status 0 if any
any input lines are selected (so, testable after the
fact)
* -v select the lines that DON'T match!
* -i ignore the case of letters in making comparisons
=====
Basic Regular Expressions (BREs) (UNIX/Linux style)
=====
* regular expression: defines a pattern of text to be matched
* several UNIX/Linux utilities expect you to specify patterns
as regular expressions (REs)
* grep, sed, ed, several others
* 2 basic categories:
Basic Regular Expressions (BREs)
* understood by "older" UNIX programs (such as ed, grep, sed)
Extended Regular Expressions (EREs)
* an extension of REs recognized by egrep (same as grep -E)
=====
BREs
=====
* first: in general, any non-special character in a BRE matches that
character in the text
grep oink *.txt # find lines with o then i then n then k within them
* SOME special characters are special ANYWHERE they appear in a pattern
. * [ \
* SOME special characters are ONLY special under particular conditions
* ^ is special only if it appears at the beginning of a pattern
* $ is special at the END of a pattern
* the character that terminates a pattern is special throughout
that pattern
* you CAN escape special character's meanings -- and just match
that character -- by escaping it with a backslash
'cheap at $9.98' # but the . is special! How match . specifically?
'cheap at $9\.98' # now can match . specifically
(and the $ is only special if it is at the END of a pattern)
* \ - escapes the special meaning of the character following it
IF it is special
what if the following character is not special?
... yikes, backslash behavior is UNDEFINED in that case,
so please try to avoid that in BREs!
* . - the dot matches any single non-null character
(so, the BRE version of globbing's ?)
IN CLASS, this BRE needed to be in double-quotes to work
as expected (within grep, at least):
grep "o\.n" animals.txt # to JUST match o, then dot, then n
* ^ - this character at the BEGINNING of the OUTERMOST RE matches
the BEGINNING of a line (anywhere else, ^ matches ^)
...that is, we want lines that START with some pattern
* $ - this character at the END of an OUTERMOST RE matches
the END of a line (anywhere else, $ matches $)
...that is, we want lines that START with some pattern
* * - * has a slightly DIFFERENT MEANING in REs than in globbing!
in REs, * goes with the character preceding it --
matches 0 or more instances of THAT character
grep ab*c *.txt # matches lines with an a
# followed by 0 or more bs
# followed by a c
* can also follow a set of characters in square brackets
[moxie] - matches one of m or o or x or i or e
[moxie]* - matches 0 or more m's o's x's i's or e's in ANY combinatn
we tried:
grep "^m[moxie]*$" moxie-play.t
...and a line:
mmmmmmmoxmxxeeie
...DID match it
NOTE: beware of a pattern that is JUST one character
followed by *
a*
...that will match ANY line!!!! (asks for a line with 0 or more a's)
Do you really want 1 or more?
aa* will match a line with 1 or more lowercase a's
* [set] - a set of characters in square brackets matches any
single character from that set
...my reference for this called this a BRACKET EXPRESSION
* ranges are allowed
[c1-c2] matches any one of of the set of characters
in the range c1 to c2, inclusive
* there are some special classes similar to those we
saw in globbing, written as [[:desired_class:]]
[[:lower:]] - matches one lowercase letter
* NOTE: the matching mechanism for BREs in UNIX/Linux is clever enough
to consider the whole line when testing for a match --
^a.*b.c$ # line starts with a
then 0 or more of anything
then b
then any one character
then c at the end of the line
axybbcc will match this;
general rule: when a BRE can be matched in more that one way,
the longest possible matching sequence will be used