LING/C SC/PSYC 438/538

LING/C SC/PSYC 438/538 Lecture 9 Sandiway Fong

Adminstrivia • Homework 2 • graded • review today • Extra Credit Exercise 2 • Optional • due tonight by midnight • Homework 3 • out today • due next Tuesday

Homework 2 Review • Output: • Sample data file: • First try.. just try to detect a repeated word

Homework 2 Review • Key: think algorithmically… • think of a specific example first w1 w2 w3 w4 w5 Compare w1 with w2 Compare w2 with w3 Compare w3 with w4 Compare w4 with w5

Homework 2 Review • Generalize specific example, then code it up Array indices start from 0… array @words words0 ,words1 … wordsn-1 Compare w1 with w1+1 Compare w2 with w2+1 “for” loop implementation Compare wn-2 with wn-2+1 Array indices end just before $#words… Compare wn-1 with wn

Homework 2 Review

Homework 2 Review a decent first pass …

Homework 2 Review • Sample data file: • Output:

Homework 2 Review • Second try.. merging multiple occurrences

Homework 2 Review • Second try.. merging multiple occurrences • Sample data file: • Output:

Homework 2 Review • Third try.. implementing a table of exceptions

Homework 2 Review • Third try.. table of exceptions • Sample data file: • Output: Stepping outside the exception table: three or more occurrences

Today’s Topic • Named Entity Recognition (NER) • Homework 3 • using Perl regular expression matching on corpora to extract Named Entities

Named Entity Recognition (NER) • Named-Entity Recognition (NER) • (also Identification and Extraction) tries to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values*, percentages, etc. *see Extra Credit Exercise 2 [paraphrased from http://en.wikipedia.org/wiki/Named-entity_recognition]

Example WSJ9_002.txt

Illinois NER System NLP systems might also compute: anaphora reference http://cogcomp.cs.illinois.edu/demo/ner/

Textbook • See JM Chapter 22: Information Extraction • 22.1 Named Entity Recognition • 22.2 Relation Detection and Classification • also Chapter 21 for Anaphora Resolution

Example 2 • use the corpus WSJ9_018.txt from the course webpage for your homework • consider only the body of articles delimited by <TEXT> … </TEXT>

Example 2

Homework 3 • Task: • Write a Perl program to extract Named Entities from corpus WSJ9_018.txt and put them in a table with corpus frequency information. • Hint: use a hash table • Design regex(s) to distinguish between (1) persons and (2) all other proper nouns sequences, e.g. organizations and so on, etc. • Hint:implement this by storing persons (PER) and other named entities (ONE) in separate tables

Homework 3 • Submit • just one PDF file • your program • description of your regexps • any other discussion… • tables arranged in order of descending frequency • (submit only the top 20 for the PER and ONE categories) If you have read http://perldoc.perl.org/perlretut.html, maybe part of code this helps for sorting based on frequency

Aside • Sample code is rather interesting: • More verbose but easier to read perhaps:

Aside • Output for: • "This is a slightly simplified version of a rather complicated piece Perl code."

LING/C SC/PSYC 438/538