270 likes | 387 Views
LING/C SC/PSYC 438/538. Lecture 9 Sandiway Fong. Adminstrivia. Homework 2 g raded r eview today Extra Credit Exercise 2 Optional due tonight by midnight Homework 3 o ut today due next Tuesday. Homework 2 Review. Output:. Sample data file:.
E N D
LING/C SC/PSYC 438/538 Lecture 9 Sandiway Fong
Adminstrivia • Homework 2 • graded • review today • Extra Credit Exercise 2 • Optional • due tonight by midnight • Homework 3 • out today • due next Tuesday
Homework 2 Review • Output: • Sample data file: • First try.. just try to detect a repeated word
Homework 2 Review • Key: think algorithmically… • think of a specific example first w1 w2 w3 w4 w5 Compare w1 with w2 Compare w2 with w3 Compare w3 with w4 Compare w4 with w5
Homework 2 Review • Generalize specific example, then code it up Array indices start from 0… array @words words0 ,words1 … wordsn-1 Compare w1 with w1+1 Compare w2 with w2+1 “for” loop implementation Compare wn-2 with wn-2+1 Array indices end just before $#words… Compare wn-1 with wn
Homework 2 Review a decent first pass …
Homework 2 Review • Sample data file: • Output:
Homework 2 Review • Second try.. merging multiple occurrences
Homework 2 Review • Second try.. merging multiple occurrences • Sample data file: • Output:
Homework 2 Review • Third try.. implementing a table of exceptions
Homework 2 Review • Third try.. table of exceptions • Sample data file: • Output: Stepping outside the exception table: three or more occurrences
Today’s Topic • Named Entity Recognition (NER) • Homework 3 • using Perl regular expression matching on corpora to extract Named Entities
Named Entity Recognition (NER) • Named-Entity Recognition (NER) • (also Identification and Extraction) tries to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values*, percentages, etc. *see Extra Credit Exercise 2 [paraphrased from http://en.wikipedia.org/wiki/Named-entity_recognition]
Example WSJ9_002.txt
Illinois NER System NLP systems might also compute: anaphora reference http://cogcomp.cs.illinois.edu/demo/ner/
Textbook • See JM Chapter 22: Information Extraction • 22.1 Named Entity Recognition • 22.2 Relation Detection and Classification • also Chapter 21 for Anaphora Resolution
Example 2 • use the corpus WSJ9_018.txt from the course webpage for your homework • consider only the body of articles delimited by <TEXT> … </TEXT>
Homework 3 • Task: • Write a Perl program to extract Named Entities from corpus WSJ9_018.txt and put them in a table with corpus frequency information. • Hint: use a hash table • Design regex(s) to distinguish between (1) persons and (2) all other proper nouns sequences, e.g. organizations and so on, etc. • Hint:implement this by storing persons (PER) and other named entities (ONE) in separate tables
Homework 3 • Submit • just one PDF file • your program • description of your regexps • any other discussion… • tables arranged in order of descending frequency • (submit only the top 20 for the PER and ONE categories) If you have read http://perldoc.perl.org/perlretut.html, maybe part of code this helps for sorting based on frequency
Aside • Sample code is rather interesting: • More verbose but easier to read perhaps:
Aside • Output for: • "This is a slightly simplified version of a rather complicated piece Perl code."