1 / 15

LING/C SC/PSYC 438/538

LING/C SC/PSYC 438/538. Lecture 6 9/13 Sandiway Fong. Administrivia. Homework out today Due next Monday (September 20 th ) by midnight. Shortest vs. Greedy Matching. default behavior in Perl RE match: longest possible matching string aka “greedy matching”

solada
Download Presentation

LING/C SC/PSYC 438/538

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING/C SC/PSYC 438/538 Lecture 6 9/13 Sandiway Fong

  2. Administrivia • Homework • out today • Due next Monday (September 20th) by midnight

  3. Shortest vs. Greedy Matching • default behavior • in Perl RE match: longest possible matching string • aka “greedy matching” • This behavior can be changed, see following slide • RE search is supposed to be fast • but searching is not necessarily proportional to the length of the input being searched • in fact, Perl RE matching can can take exponential time (in length) • non-deterministic • may need to backtrack (revisit) if it matches incorrectly part of the way through linear time time length length exponential

  4. Shortest vs. Greedy Matching from http://www.perl.com/doc/manual/html/pod/perlre.html • Example: $_ = "The food is under the bar in the barn."; if ( /foo(.*)bar/ ) { print "got <$1>\n"; } • Notes: • $_ is the default variable for matching • $1 refers to the parenthesized part of the match (.*) • Output: • got <d is under the bar in the >

  5. Shortest vs. Greedy Matching from http://www.perl.com/doc/manual/html/pod/perlre.html • Example: $_ = "The food is under the bar in the barn."; if ( /foo(.*?)bar/ ) { print "got <$1>\n"; } • Notes: • ? immediately following a repetition operator like * makes the operator work in non-greedy mode • Output: • got <d is under the >

  6. Split • @array = split /re/, string • splits string into a list of substrings split by re. Each substring is stored as an element of @array. • Examples (from perlrequick tutorial):

  7. Split • More examples: m!re! (using ! – or some other character - as a RE delimiter) Is equivalent to /re/

  8. Range Abbreviations: period (.) stands for any character (except newline) \d (digit) = [0-9] \s (whitespace character) = space (SP), tab (HT), carriage return (CR), newline (LF) or form feed (FF) \w (word character) = [0-9a-zA-Z_] uppercase versions, e.g. \D and \W denote negation... Line-oriented metacharacters: caret (^) at the beginning of a regexp string matches the “beginning of a line” dollar sign ($) at the end of a regexp string matches the “end of the line” Word-oriented metacharacters: a word is any sequence of digits [0-9], underscores (_) and letters [a-zA-Z] \b matches a word boundary could be the beginning of line, a whitespace character, etc. Words and Lines

  9. Homework • Theme: dealing with raw text • File: data/written_1/journal/slate/3/Article247_499.txt • (ANC – American National Corpus: 100 million words) • Genre: journal, (Slate Magazine article from 1998) • Sample: • Really Juvenile Reynolds • USA • Today and the Washington Post lead with revelations from newly disclosed • R.J. Reynolds internal documents that seem to show that the company has • persistently attempted to market cigarettes to teens. This is also the top • national story at the Los Angeles Times . The New York Times • leads with the U.N. Security Council's vote telling Iraq to honor previous • promises to allow U.N. inspectors complete access to suspected weapons • sites. • The new tobacco documents (many of them marked "Secret"), released as part • of a lawsuit settlement, show a company strategy of attracting teenagers • through advertising and various youth-oriented promotions such as, according to • USAT , "NASCAR sponsorship," "inner city activities," and "T-shirts and • other paraphernalia." And says USAT , the documents show that RJR's • introduction of "Joe Camel" fits in to this strategy.

  10. Homework • One of the first steps in processing raw text is to clean and mark it up (xml) • Task 1 438/538 (15pts) • write a Perl program that counts the number of paragraphs and sentences for Article247_499.txt (download from class webpage) • See next slide for output format • Discuss what the technical problems are with sentence boundary markup and describe your solution. • e.g. what regular expressions you are going to use • Submit your program and its output on Article247_499.txt

  11. Homework Help • Useful code fragment • use previously described template: open($txtfile,$ARGV[0]) or die "$ARGV[0] not found!\n"; while ($line = <$txtfile>) { do RE stuff with $line } • Example: perlprocessfile.plArticle247_499.txt

  12. Homework Help • <$line> reads in a line of text including the newline (\n) character • so lines are one character longer than you might think • The real world is messy • Article247_499.txt is not quite uniform: sentences are split across lines, it may contain extra whitespace and invisible characters you can’t see with a regular text editor. • The file Article247_499.txt you are given is actually not quite raw text • I’ve pre-converted it to ASCII (UTF-8) for you to make life a bit easier • Original was in UTF-16 (big-endian) with nasty non-printable BOM (U+FEFF) and null characters

  13. Homework Help • You will need to determine how you’re going to pattern match paragraph separators and end of sentences. Input Delimiter http://www.bayview.com/blog/2002/07/29/input-delimiter/

  14. Homework • Sample: • Really Juvenile Reynolds • USA • Today and the Washington Post lead with revelations from newly disclosed • R.J. Reynolds internal documents that seem to show that the company has • persistently attempted to market cigarettes to teens. This is also the top • national story at the Los Angeles Times . The New York Times • leads with the U.N. Security Council's vote telling Iraq to honor previous • promises to allow U.N. inspectors complete access to suspected weapons • sites. • The new tobacco documents (many of them marked "Secret"), released as part • of a lawsuit settlement, show a company strategy of attracting teenagers • through advertising and various youth-oriented promotions such as, according to • USAT , "NASCAR sponsorship," "inner city activities," and "T-shirts and • other paraphernalia." And says USAT , the documents show that RJR's • introduction of "Joe Camel" fits in to this strategy. Note: Assume blank lines separate paragraphs Output Format Paragraph 1: No. of sentences: 1 Paragraph 2: No. of sentences: 3 Paragraph 3: No. of sentences: 3 etc. paragraph paragraph

  15. Homework • Task 2 438/538 (15pts) • Modify your Perl program to produce xml paragraph and sentence boundary markup for Article247_499.txt • i.e. produces reformatted raw text as • <p> • <s>sentence 1</s> • <s>sentence 2</s> • </p> … • Each <s>..</s> should occupy exactly one line of your output. • Leading and trailing spaces of a sentence should be deleted, e.g. • <s> The new tobacco … • vs. <s>The new tobacco … • Submit your program and its output on Article247_499.txt (Cut and paste everything from both tasks into one file for submission)

More Related