210 likes | 354 Views
CSC 4630. Meeting 21 April 4, 2007. Return to Perl. Where are we? What is confusing? What practice do you need?. Ray’s Problem. Given a string of the form: 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 b 9 = 100 replace the 8 b’s with one plus sign two minus signs
E N D
CSC 4630 Meeting 21 April 4, 2007
Return to Perl • Where are we? • What is confusing? • What practice do you need?
Ray’s Problem Given a string of the form: 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 b 9 = 100 replace the 8 b’s with • one plus sign • two minus signs • five empty strings, signifying close up the spacing to make a number and find which replacements yield a true statement.
Ray’s Problem (2) Thoughts on the answer: • 1234-56-78+9 = 100 is an example • How many possible strings are there? • Proof by exhaustion may be the best
Regular Expressions Revisited Returning to a fundamental structure • Theoretically defined • Implemented in grep, egrep, • Implemented in awk, gawk, nawk • Implemented in Perl
RE(2) • Theoretically a RE defines a set of strings on an alphabet • In implementation matching with a RE checks whether the current string is an element of a set of strings that is constructed from the strings defined theoretically.
RE(3) • A single character c • Theoretically defines the set of strings {c} • Which generates the set of matching lines {ScT}, where S and T are arbitrary, possibly empty strings. • In implementation, • grep c somelines returns ______________ • awk “/c/” somelines returns ______________ • if (/c/) print {$_;} returns ______________
RE(4) so grep c somelines is equivalent to perl re1 <somelineswhere re1 is the Perl program while <STDIN> { if (/c/) {print $_;} }
RE(5) • Theoretically if r and s are regular expressions defining languages L and M respectively, then • rs defines the language LM, meaning concatenate a string in L with a string in M • Hence, • grep abc somelines • awk “/abc/” somelines • while <STDIN> { if (/abc/) {print $_;}}
RE(6) all return the lines that are contained in the set {SabcT} where S and T are arbitrary, possibly empty strings. Details: /a/ defines {a}, /b/ defines {b}, /c/ defines {c} /abc/ defines {abc} by concatenation Lines matching /abc/ are in {SabcT}
RE(7) • The * operator shows that the previous simple regular expression is repeated 0 or more times. • /ab*c/ defines the language formed as the union of the languages defined by /ac/, /abc/, /abbc/, /abbbc/, etc. This is the set {abnc | n = 0,1,2, …} (an infinite set) • Hence /ab*c/ matches any string of the form SabncT
RE(8) • The symbol . designates any character in the alphabet (What is the alphabet we’re using?) except \n which stands for newline. (A Perl definition, check for the various shells and the various awks). • Thus . defines the language A-{\n} • And . matches any line that contains at least one character. Officially an empty line looks like \n and every line ends with \n
RE(9) Exercise: Construct all possible lines of text that will not be matched by /a./ Exercise: Construct all possible lines of text that will be matched by /.a.b./ Exercise: Regardless of their content, what lines of text will not be matched by /.a.b./
RE(10) Character Classes • Any set of characters enclosed in brackets • The vowels [aeiou] • Any range of consecutive ASCII coded characters enclosed in brackets • The lower case letters [a-z] • The digits [0-9] • The hex digits [0-9A-F]
RE(12) • Including special characters in the set • To get ], use \] or []a-z] (Think about reading this string character by character to learn its meaning.) • To get -, use \- or [a-z-] • Complementing (not complimenting) a set • Use ^ as leading character, [^0-9] or [^aeiou] • More special characters • To get ^, use \^ or place it away from the first position [a-z^_]
RE(13) The Matching Game: • [0123456789] • [0-9] • [0-9\-] • [a-z0-9] • [a-zA-Z0-9_] • [^0-7] • [^A-M.,;] • [^\^] • [0 - 9] • [.]
RE(14) Short character set names • \d means [0-9] • \D means [^0-9] • \w means [a-zA-Z0-9_] (identifier characters) • \W means [^a-zA-Z0-9_] • \s means [ \r\t\n\f] • \S means [^ \r\t\n\f]
RE(15) More repetition symbols • b* means zero or more repetitions of b, as does b{0,} • b+ means one or more repetitions of b, as does b{1,} • b? means zero or one repetitions of b, as does b{0,1} • b{5,8} means five, six, seven or eight repetitions of b • b{4} means exactly four repetitions of b
RE(16) • Splitting a string split(/:/,$line) divides $line into substrings at the colons and places the substrings in a list (array) Note: Two adjacent colons :: produce an empty string. split(/:+/,$line) divides $line into nonempty substrings
Andy’s Problem Lines from a text file look like • 105028|Adam Mrugalski|AJM Residential|1067 Shoecraft rd|Webster|NY|14580||||||ajmresidential@yahoo.com||No||No|||Thu Dec 21 21:23:23 2006| • 105029|robert ritchey|robert industries|po box 472|crockett |ca|94525|510-787-7290|||||send2rr@gmail.com||No||No|||Fri Dec 22 02:54:54 2006| • 105030|Jack Still|WISE TV|PO BOX 280|Coeburn|VA|24230|2763959339|||||wisetv19@msn.com||No||No||9feet 1inch floor to floor. Connects to balcony. Need oak 4 feet round with landing at top. Send me a quote. J. Still WISE TV |Fri Dec 22 03:18:19 2006|
Andy (2) The lines need to be cleaned and parsed into several reports: • Phone contact information • Email contact information • Address labels • Full data base, checking for unique entries