200 likes | 310 Views
LING/C SC/PSYC 438/538. Lecture 9 9/22 Sandiway Fong. Administrivia. Review Paragraph/sentence homework Set of states construction exercises. Prime Number Testing using Perl Regular Expressions. [Thanks to Mark Tokutomi …] I mentioned that
E N D
LING/C SC/PSYC 438/538 Lecture 9 9/22 Sandiway Fong
Administrivia • Review • Paragraph/sentence homework • Set of states construction exercises
Prime Number Testing using Perl Regular Expressions [Thanks to Mark Tokutomi…] • I mentioned that • Regular expressions are equivalent to regular grammars and finite state automata • But Perl has added so many bells and whistles to its flavor of regular expressions so that this equivalence no longer holds true. • In fact, you can code up prime number testing using Perl regular expressions… lots of references on the web. • If we represent a number in unary notation, e.g. 5 = “11111” • /^(11+?)\1+$/ will match anything that’s greater than 1 that’s not prime • Key to making this work: \1 backreference • L = {1n | n is prime} is not a regular language
Homework Review • From lecture 6 • Task 1 438/538 (15pts) • write a Perl program that counts the number of paragraphs and sentences for Article247_499.txt • “Raw text” • Blank lines separate paragraphs • Sentences are not in 1-to-1 correspondence with lines We can do fairly well. But a perfect solution would require “solving” language.
Background • Article • Article247_499.txt comes from the American National Corpus (ANC) (Release 2) • ANC sentence boundary labeling • All sentence markup was automatically produced by the sentence splitter included in the Gate system, and there are occasional errors, usually due to the presence of unrecognized abbreviations in mid-sentence. • The sentence splitter also puts punctuation appearing after the terminating period (e.g., closing quotation mark, closing parenthesis) outside the sentence boundary. • Open ANC (OANC) • All annotations were originally produced automatically using GATE's ANNIE system. • Some of the texts in the OANC include manually validated sentence boundaries (the list of texts validated for sentence boundaries is here).
Background Stand-off annotation:
Background p1 has 4 sentences according to the official automatic sentence boundary detector p1
Homework Review Let’s develop this program step-by-step… • want to initialize and maintain two counters • $pnumber of paragraphs • $snumber of sentences • Paragraph Counting • Blank lines separate paragraphs • Regular expression: /^\s*\n$/ • i.e. containing only a newline (possibly preceded by superfluous whitespace) • Start with file reading code template: open($txt, $ARGV[0]) or die "can't open $ARGV[0]!\n"; my $p = 0; my $s = 0; while ($line = <$txt>) { }
Homework Review • Code: open($txt, $ARGV[0]) or die "can't open $ARGV[0]!\n"; my $p = 0; my $s = 0; while ($line = <$txt>) { $p++ if $line is a blank line and last line wasn’t a blank line } print “Number of paragraphs: $p\n”; • Paragraph counting • blank line: /^\s*\n$/ • consecutive blank lines count as one • Programming technique: • use a flag to signal whether last line was blank
Homework Review • Sentence counting Basics • In English, periods, question marks and exclamation marks (not present in supplied article) are used to mark the end of a sentence • [.!?] • followed optionally by closing double or single quotes and/or right parentheses, square brackets, curly braces etc. (not all present in supplied article) • []"')}]* • followed by white space or end of the line • ($|\s) First cut Increment sentence counter if we find a sentence ending period… $s++ if $line =~ /[.!?][]"')}]*($|\s)/ • Examples: • U.N. Security • U.N. inspectors • R.J. Reynolds • James Q. Wilson, • Georgia Gov. Zell • strategy. • teens. This • Times . The • created." • paraphernalia." And • already?" • thrilled.") • Really Juvenile Reynolds
Homework Review • Debugging help • Useful to print out what preceded the matching regular expression • <R.J> • <persistently attempted to market cigarettes to teens> • <national story at the Los Angeles Times > • <leads with the U.N> • <promises to allow U.N> • <sites> • Paragraph 2, 6 sentences • <other paraphernalia> • <introduction of "Joe Camel" fits in to this strategy> • Paragraph 3, 2 sentences • if ($line =~ /[.!?][]"')}]*($|\s)/) { • $s++; • print "<$`>\n”; • }
Homework Review • Examples: • U.N. Security • U.N. inspectors • R.J. Reynolds • James Q. Wilson, • Georgia Gov. Zell • strategy. • teens. This • Times . The • created." • paraphernalia." And • already?" • thrilled.") • Really Juvenile Reynolds • Heuristics • Inexact methods (not foolproof) • Spot • Initials (Q.) • Abbreviations (U.N.) and titles (Mr., Ms., Ms) • headlines title Be careful… Consider Call IBM. I like the letter J. No period $p++; $s++ if ($s == 0); print "Paragraph $p, $ssentence(s)\n”; $s = 0; Suppose headline counts as one sentence
Homework Review • Examples: • U.N. Security • U.N. inspectors • R.J. Reynolds • James Q. Wilson, • Georgia Gov. Zell • Heuristic • Period doesn’t end sentence if immediately preceded by a capital letter • works for all the outstanding cases in the sample article except Gov. • Implementation: • /[^A-Z][.!?][]"')}]*($|\s)/ • or • /(?<![A-Z])[.!?][]"')}]*($|\s)/ • (?<!re) lookbehind negative assertion (perlretut) • for better compatibility with $` Call IBM. I like the letter J. block Gov. explicitly if ($line =~ /(?<![A-Z])[.!?][]"')}]*($|\s)/) { $s++ if $` !~ /\bGov$/; }
Homework Review • Examples • "There is nothing in the [Unabomber] manifesto that looks at • all like the work of a madman. The language is clear, precise and calm. The • Problem • multiple sentence ending periods on one line • One solution: • use global matching //g with a while loop • seeperlretut • E.g. while ($line =~ /[.!?][]"')}]*($|\s)/g) { $s++; }
Homework Review 1 • Answer: • Paragraph 1, 1 sentence(s) • Paragraph 2, 3 sentence(s) • Paragraph 3, 2 sentence(s) • Paragraph 4, 2 sentence(s) • Paragraph 5, 1 sentence(s) • Paragraph 6, 2 sentence(s) • Paragraph 7, 2 sentence(s) • Paragraph 8, 3 sentence(s) • Paragraph 9, 1 sentence(s) • Paragraph 10, 2 sentence(s) • Paragraph 11, 5 sentence(s) • Paragraph 12, 1 sentence(s) 2 3 4 5 6 7 8 9 10 11 12
Homework Review • Task 2 438/538 (15pts) • Modify your Perl program to produce xml paragraph and sentence boundary markup for Article247_499.txt • i.e. produces reformatted raw text as • <p> • <s>sentence 1</s> • <s>sentence 2</s> • </p> … • Revisions • No need to maintain counter for paragraphs or sentences • Re-use variable $s to accumulate bits of current sentence • Print </p> whenever we have a blank line that wasn’t preceded by another blank line while ($line = <$txt>) { if ($line =~ /^\s*\n$/) { print "</p>\n” if (!$lastblank); $lastblank = 1; } else { … • Use a function trim to take out leading and trailing spaces before printing string inside <s> … </s> sub trim{ my $str = shift; $str =~ s/^\s+//; $str =~ s/\s+$//; return $str }
Homework Review • Revisions • Print <p> whenever we see a first non-blank line while ($line = <$txt>) { if ($line =~ /^\s*\n$/) { print "</p>\n” if (!$lastblank); $lastblank = 1; } else { … print "<p>\n" if $lastblank == 1; • Split the line according to the sentence-ending regular expression but add extra parentheses to retain the split pattern in the array $line =~ s/\n/ /; @a = split /((?<![A-Z])[.!?][]"')}]*($|\s))/, $line; • Notes: • convert newline into space first • ($|\s) is also stored! (discard it) • Revisions • Example: • persistently attempted to market cigarettes to teens. This is also the top gets split into • a[0] <persistently attempted to market cigarettes to teens> • a[1] <. > • a[2] < > • a[3] <This is also the top > • Example: • sites. gets split into • a[0] <sites> • a[1] <. > • a[2] < > $s .= shift @a; to append these items onto the end of $s Discard using shift @a;
Homework Review • Revisions • Example: • all like the work of a madman. The language is clear, precise and calm. The gets split into • a[0] <all like the work of a madman> • a[1] <. > • a[2] < > • a[3] <The language is clear, precise and calm> • a[4] <. > • a[5] < > • a[6] <The > • Code (shift and discard): @a = split /((?<![A-Z])[.!?][]"')}]*($|\s))/, $line; if($#a == 0) { $s .= $line; } else { while ($#a >= 0) { if ($#a == 0 || $a[0] =~ / Gov$/) { $s = shift @a; } elsif ($#a >= 2) { $s .= shift @a; $s .= shift @a; shift @a; # superfluous ($|\s) print "<s>".trim($s)."</s>\n"; $s = ""; } else { print error } } } $lastblank = 0;
Homework Review • Output • <p> • <s>Really Juvenile Reynolds</s> • </p> • <p> • <s>USA Today and the Washington Post lead with revelations from newly disclosed R.J. Reynolds internal documents that seem to show that the company has persistently attempted to market cigarettes to teens.</s> • <s>This is also the top national story at the Los Angeles Times .</s> • <s>The New York Times leads with the U.N. Security Council's vote telling Iraq to honor previous promises to allow U.N. inspectors complete access to suspected weapons sites.</s> • </p> • <p> • <s>Social scientist James Q. Wilson, in a Times op-ed piece, makes the following argument for the sanity of Ted Kacszynski: "There is nothing in the [Unabomber] manifesto that looks at all like the work of a madman.</s> • <s>The language is clear, precise and calm.</s> • <s>The argument is subtle and carefully developed, lacking anything even faintly resembling the wild claims or irrational speculation that a lunatic might produce."</s> • <s>Wilson also observes that besides the Unabomber's manifesto, his "skill in manufacturing bombs and the clever ways in which he concealed his identity suggest to me that he was clearly sane."</s> • <s>Of course, the legal usefulness of these observations is somewhat dubious, because this attempt to show that Theodore Kacszynski is fit to stand trial depends on the assumption that he is the Unabomber, which in turn requires a trial first, which first requires showing that he is fit to stand trial, and so on.</s> • </p> • <p> • <s>With NBC's deal to retain "ER" at a cost of $13 million an episode getting front-page coverage at the NYT , LAT , and the WSJ , maybe the next big domestic policy issue will be controlling television health care costs.</s> • </p> … Your output should resemble this…
Homework Review • That’s all folks! • (If you were a non-programmer before this class, and you made it this far – Congratulations! You are entitled to call yourself a programmer now…)