LING/C SC/PSYC 438/538

LING/C SC/PSYC 438/538 Lecture 7 Sandiway Fong

Today's Topics • A note on Unicode • Homework 2 review • Homework 3: due next Monday 11:59pm • usual rules: one PDF file etc. • Perl regex contd.

Unicode and \w • Recall: \w is [0-9A-Za-z_] • Experiment: use utf8; use open qw(:std :utf8); my $a = "school écoleÉcolešolatrườngस्कूलškoleโรงเรียน"; @words = ($a =~ /(\w+)/g); foreach $word (@words) { print "$word\n" }

Homework 2 Review • Sample data file: • First try.. just try to detect a repeated word

Homework 2 Review • Sample data file: • Sample output:

Homework 2 Review • Key: think algorithmically… • think of a specific example first w1 w2 w3 w4 w5 Compare w1 with w2 Compare w2 with w3 Compare w3 with w4 Compare w4 with w5

Homework 2 Review • Generalize specific example, then code it up Array indices start from 0… array @words words0 ,words1 … wordsn-1 Compare w1 with w1+1 Compare w2 with w2+1 “for” loop implementation Compare wn-2 with wn-2+1 Array indices end just before $#words… Compare wn-1 with wn

Homework 2 Review

Homework 2 Review a decent first pass …

Homework 2 Review • Sample data file: • Output:

Homework 2 Review • Second try.. merging multiple occurrences

Homework 2 Review • Second try.. merging multiple occurrences • Sample data file: • Output:

Homework 2 Review • Third try.. implementing a simple table of exceptions

Homework 2 Review • Third try.. table of exceptions • Sample data file: • Output:

Homework 3 Corpus file WSJ9_00x.txt (Tipster Vol 1.): <DOC> <DOCNO> WSJ891102-0170 </DOCNO> <DD> = 891102 </DD> <AN> 891102-0170. </AN> <HL> International: @ Australian Firm's Purchase </HL> <DD> 11/02/89 </DD> <SO> WALL STREET JOURNAL (J) </SO> <CO> A.FHF MBIO </CO> <IN> TENDER OFFERS, MERGERS, ACQUISITIONS (TNM) </IN> <DATELINE> SYDNEY </DATELINE> <TEXT> F.H. Faulding & Co., an Australian pharmaceuticals company, said its Moleculon Inc. affiliate acquired Kalipharma Inc. for $23 million. Kalipharma is a New Jersey-based pharmaceuticals concern that sells products under the Purepac label. Faulding said it owns 33% of Moleculon's voting stock and has an agreement to acquire an additional 19%. That stake, together with its convertible preferred stock holdings, gives Faulding the right to increase its interest to 70% of Moleculon's voting stock. </TEXT> </DOC> Skip

Homework 3 • Write a Perl program with regular expressions to extract all the dollar amounts from the file within the <TEXT> </TEXT> markups. • Examples (actual): • $23 million • $38.3 billion • C$1,000 • $95,142 • $3.01 • $38.375 • $45 • 16.125 Canadian dollars

Homework 3 • Submit: • your program • document your regex • how many dollar amounts you extracted • The largest dollar amount • The smallest dollar amount • The median dollar amount (assume, for simplicity, that C$ and US$ are worth the same) • Appendix: list all the dollar amounts

Shortest vs. Greedy Matching from http://www.perl.com/doc/manual/html/pod/perlre.html • Example: $_ = "The food is under the bar in the barn."; if ( /foo(.*)bar/ ) { print "matched <$1>\n"; } • Output: • matched <d is under the bar in the > • Notes: • default variable $_ is also the default variable for matching • variable $1 refers to the parenthesized part of the match (.*) Default variable implicit $_ =~

Shortest vs. Greedy Matching • default behavior • in Perl RE match: • take thelongest possible matching string • aka greedy matching • This behavior can be changed, see next slide

Shortest vs. Greedy Matching from http://www.perl.com/doc/manual/html/pod/perlre.html • Example: $_ = "The food is under the bar in the barn."; if ( /foo(.*?)bar/ ) { print ”matched <$1>\n"; } • Output: • matched <d is under the > • Notes: • ? immediately following a repetition operator like * (or +) makes the operator work in non-greedy mode

Shortest vs. Greedy Matching from http://www.perl.com/doc/manual/html/pod/perlre.html • Example: $_ = "The food is under the bar in the barn."; if ( /foo(.*?)bar/ ) { print ”matched <$1>\n"; } • Output: • greedy: matched <d is under the bar in the > • shortest: matched <d is under the > (.*?) (.*)

Shortest vs. Greedy Matching • RE search is supposed to be fast • but searching is not necessarily proportional to the length of the input being searched • in fact, Perl RE matching can can take exponential time (in length) • non-deterministic • may need to backtrack (revisit) if it matches incorrectly part of the way through linear time time length length exponential

Global Matching: scalar context g flag in the condition of a while-loop

Global Matching: list context g flag in list context

Split • @array = split /re/, string • splits string into a list of substrings split by re. Each substring is stored as an element of @array. • Examples (from perlrequick tutorial):

Split

Matched Positions

LING/C SC/PSYC 438/538