1 / 27

LING/C SC/PSYC 438/538

LING/C SC/PSYC 438/538. Lecture 9 Sandiway Fong. Adminstrivia. Homework 2 g raded r eview today Extra Credit Exercise 2 Optional due tonight by midnight Homework 3 o ut today due next Tuesday. Homework 2 Review. Output:. Sample data file:.

raisie
Download Presentation

LING/C SC/PSYC 438/538

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING/C SC/PSYC 438/538 Lecture 9 Sandiway Fong

  2. Adminstrivia • Homework 2 • graded • review today • Extra Credit Exercise 2 • Optional • due tonight by midnight • Homework 3 • out today • due next Tuesday

  3. Homework 2 Review • Output: • Sample data file: • First try.. just try to detect a repeated word

  4. Homework 2 Review • Key: think algorithmically… • think of a specific example first w1 w2 w3 w4 w5 Compare w1 with w2 Compare w2 with w3 Compare w3 with w4 Compare w4 with w5

  5. Homework 2 Review • Generalize specific example, then code it up Array indices start from 0… array @words words0 ,words1 … wordsn-1 Compare w1 with w1+1 Compare w2 with w2+1 “for” loop implementation Compare wn-2 with wn-2+1 Array indices end just before $#words… Compare wn-1 with wn

  6. Homework 2 Review

  7. Homework 2 Review

  8. Homework 2 Review

  9. Homework 2 Review

  10. Homework 2 Review

  11. Homework 2 Review a decent first pass …

  12. Homework 2 Review • Sample data file: • Output:

  13. Homework 2 Review • Second try.. merging multiple occurrences

  14. Homework 2 Review • Second try.. merging multiple occurrences • Sample data file: • Output:

  15. Homework 2 Review • Third try.. implementing a table of exceptions

  16. Homework 2 Review • Third try.. table of exceptions • Sample data file: • Output: Stepping outside the exception table: three or more occurrences

  17. Today’s Topic • Named Entity Recognition (NER) • Homework 3 • using Perl regular expression matching on corpora to extract Named Entities

  18. Named Entity Recognition (NER) • Named-Entity Recognition (NER) • (also Identification and Extraction) tries to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values*, percentages, etc. *see Extra Credit Exercise 2 [paraphrased from http://en.wikipedia.org/wiki/Named-entity_recognition]

  19. Example WSJ9_002.txt

  20. Illinois NER System NLP systems might also compute: anaphora reference http://cogcomp.cs.illinois.edu/demo/ner/

  21. Textbook • See JM Chapter 22: Information Extraction • 22.1 Named Entity Recognition • 22.2 Relation Detection and Classification • also Chapter 21 for Anaphora Resolution

  22. Example 2 • use the corpus WSJ9_018.txt from the course webpage for your homework • consider only the body of articles delimited by <TEXT> … </TEXT>

  23. Example 2

  24. Homework 3 • Task: • Write a Perl program to extract Named Entities from corpus WSJ9_018.txt and put them in a table with corpus frequency information. • Hint: use a hash table • Design regex(s) to distinguish between (1) persons and (2) all other proper nouns sequences, e.g. organizations and so on, etc. • Hint:implement this by storing persons (PER) and other named entities (ONE) in separate tables

  25. Homework 3 • Submit • just one PDF file • your program • description of your regexps • any other discussion… • tables arranged in order of descending frequency • (submit only the top 20 for the PER and ONE categories) If you have read http://perldoc.perl.org/perlretut.html, maybe part of code this helps for sorting based on frequency

  26. Aside • Sample code is rather interesting: • More verbose but easier to read perhaps:

  27. Aside • Output for: • "This is a slightly simplified version of a rather complicated piece Perl code."

More Related