1 / 18

Extracting Academic Affiliations Status Report

Develop an algorithm to extract academic affiliations, degrees, and positions from web pages. Generate queries, extract and assess patterns, and fill slots for accurate information retrieval. Improve queries for detail accuracy.

sdillon
Download Presentation

Extracting Academic Affiliations Status Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting Academic AffiliationsStatus Report Alicia Tribble Einat Minkov Andy Schlaikjer Laura Kieras

  2. The Problem • Identify people who are affiliated with an academic institution • Degrees earned • Positions held (student, post-doc, faculty) • Current position • Class of beliefs to be learned: • affiliated(<person>,<degree>,<institution>)

  3. The System /Algorithm Query Generator Search EngineInterface Query relation Query pattern Html files Extract patterns Extract relations Patterns Relations (facts) Assess patterns Assess relations

  4. Algorithm Details • Pattern query formulation • Replace <arg> in pattern string with '*' operator • Remove leading and trailing '*'s • Wrap query string in quotes • Example: "<PERSON> received his <DEGREE> from <UNIVERSITY>" -becomes- '"received his * from"'

  5. Algorithm Details • Relation Extraction (Slot filling) • Find the relevant sentence/s on a page • Alignment – slot filling • Some cleanup – “he”, capitalization • Examples: Robertson, Ph.D. in ecology and evolutionary biology, Indiana University Jeff, B.S., Bucknell University Rex Jung, degree, University of New Mexico Alavosius, BA in psychology, Clark University Jacobs, B.E.E. degree, Cornell University He, Associates Degree in Livestock Production, Northeast Community College

  6. Algorithm Details • Relation query formulation • All argument values become query terms • Example: (William Cohen, Ph.D., Rutgers) -becomes- 'William Cohen Ph.D. Rutgers'

  7. Algorithm Details • Pattern Extraction • Build a regex from a relation, one per argument • (Mr\.|Mr|MR|M\.?+r\.?+|Dr\.?+|Mrs\.?+|MRS|Ms|MS)* ?+(Scott Fahlman|Scott|Fahlman) • ([a-zA-Z]*? [dD]egree|[Dd]octoral [Dd]egree|PhD|Ph\.D\.|Doctorate|PHD) • (MIT) • Apply regex to input and for every match, extract intermediate string and generalize <PERSON> received her <DEGREE> from <UNIVERSITY> <PERSON> received his <DEGREE> from <UNIVERSITY> <PERSON> earned a <DEGREE> from <UNIVERSITY> <PERSON>s, MD <UNIVERSITY>

  8. Experimental Settings • Initial seeds • Relations affiliated('William Cohen', 'Ph.D.', 'Duke University') affiliated('Tom Mitchell', 'Ph.D.', 'Stanford') affiliated('Scott Fahlman', 'Ph.D.', 'MIT') • Patterns <PERSON> received his <DEGREE> from <UNIVERSITY> <PERSON> earned his <DEGREE> from <UNIVERSITY> <PERSON> earned a <DEGREE> from <UNIVERSITY> • Testing and development performed with 2 bootstrap iterations, using only Google snippets

  9. Results! inital: • patterns: 3 • relations: 3 iteration 0: • patterns: 6 (+3) • relations: 13 (+3) iteration 1: • patterns: 14 (+9) • relations: 0 total: • patterns: 23 • relations: 16

  10. Interim Conclusions • Issue I: over-specificity of queries arguments Q: "Oren Etzioni" "Ph.D" "CMU" But, what if actual relevant mention includes:A: "Oren Etzioni", "doctorate" "Carnegie Mellon University".. ? • Possible avenues: • Larger dictionaries • Unquote query arguments? (allow for some variation) • Allow argument values to include random terms "Oren * Etzioni" This might incorporate more noise, and require additional queries to be issued per relation.

  11. Interim Conclusions • Issue II: name and pronoun resolution Q: "Oren Etzioni" "Ph.D" "CMU" But, what if actual relevant mention includes:A: "He recieved his Ph.D from CMU in..." • Rate of occurance of "S/he..." in extracted relations • 1 pattern, 50 queries: 56.8% (96/169) • Possible avenues: • Identify homepages and extract names from titles, or other unambiguous sources on page • Pronoun resolution simple techniques?? (for example, identify immediate previous name mentions. This may require NER.)

  12. Interim Conclusions • Issue III: compound sentences Q: "Oren Etzioni" "Ph.D" "CMU" But, what if actual relevant mention includes:A: "Oren Etzioni recieved his MS from <UNIVERSITY>, and his Ph.D from CMU" • Possible avenues: • Extensions to pattern extraction techinque • May require dependency parsing

  13. Software / Resources • A generic search framework which allows asynchronous processing of search tasks, as well as "filter" tasks (processing of resulting URLs) • A URL caching implementation of Java 1.5's java.net.ResponseCache using Hibernate, supporting centralized caching and remote access

  14. Extraction Extraction URL Generic Search Framework Extraction Search SearchProcessor Test run: 1 Search50 URLs169 Extractions15 seconds Search Search Tasks Filter Tasks Result Filter

  15. Relation Validate Relation SearchProcessor Pattern Validate Pattern Search Framework System Flow

  16. Extensions • Dictionaries - next slide • Simple pronoun resolution • Extraction validation metrics • URL of professor’s personal home page • Clustering of people / universities, or normalization of names • Identify biography section of personal home pages • Links incoming and outgoing from personal home page

  17. Additional information • Dictionary of institution names • Tiny dictionary of degrees • E.g. Ph.D., B.S., B. Tech., etc • Map of domain names to institution names • E.g. cmu.edu : Carnegie Mellon University • This could be learned but we will leave that for another group!

  18. Example extracted relations • Dictionary of institution names • Tiny dictionary of degrees • E.g. Ph.D., B.S., B. Tech., etc • Map of domain names to institution names • E.g. cmu.edu : Carnegie Mellon University • This could be learned but we will leave that for another group!

More Related