180 likes | 195 Views
Develop an algorithm to extract academic affiliations, degrees, and positions from web pages. Generate queries, extract and assess patterns, and fill slots for accurate information retrieval. Improve queries for detail accuracy.
E N D
Extracting Academic AffiliationsStatus Report Alicia Tribble Einat Minkov Andy Schlaikjer Laura Kieras
The Problem • Identify people who are affiliated with an academic institution • Degrees earned • Positions held (student, post-doc, faculty) • Current position • Class of beliefs to be learned: • affiliated(<person>,<degree>,<institution>)
The System /Algorithm Query Generator Search EngineInterface Query relation Query pattern Html files Extract patterns Extract relations Patterns Relations (facts) Assess patterns Assess relations
Algorithm Details • Pattern query formulation • Replace <arg> in pattern string with '*' operator • Remove leading and trailing '*'s • Wrap query string in quotes • Example: "<PERSON> received his <DEGREE> from <UNIVERSITY>" -becomes- '"received his * from"'
Algorithm Details • Relation Extraction (Slot filling) • Find the relevant sentence/s on a page • Alignment – slot filling • Some cleanup – “he”, capitalization • Examples: Robertson, Ph.D. in ecology and evolutionary biology, Indiana University Jeff, B.S., Bucknell University Rex Jung, degree, University of New Mexico Alavosius, BA in psychology, Clark University Jacobs, B.E.E. degree, Cornell University He, Associates Degree in Livestock Production, Northeast Community College
Algorithm Details • Relation query formulation • All argument values become query terms • Example: (William Cohen, Ph.D., Rutgers) -becomes- 'William Cohen Ph.D. Rutgers'
Algorithm Details • Pattern Extraction • Build a regex from a relation, one per argument • (Mr\.|Mr|MR|M\.?+r\.?+|Dr\.?+|Mrs\.?+|MRS|Ms|MS)* ?+(Scott Fahlman|Scott|Fahlman) • ([a-zA-Z]*? [dD]egree|[Dd]octoral [Dd]egree|PhD|Ph\.D\.|Doctorate|PHD) • (MIT) • Apply regex to input and for every match, extract intermediate string and generalize <PERSON> received her <DEGREE> from <UNIVERSITY> <PERSON> received his <DEGREE> from <UNIVERSITY> <PERSON> earned a <DEGREE> from <UNIVERSITY> <PERSON>s, MD <UNIVERSITY>
Experimental Settings • Initial seeds • Relations affiliated('William Cohen', 'Ph.D.', 'Duke University') affiliated('Tom Mitchell', 'Ph.D.', 'Stanford') affiliated('Scott Fahlman', 'Ph.D.', 'MIT') • Patterns <PERSON> received his <DEGREE> from <UNIVERSITY> <PERSON> earned his <DEGREE> from <UNIVERSITY> <PERSON> earned a <DEGREE> from <UNIVERSITY> • Testing and development performed with 2 bootstrap iterations, using only Google snippets
Results! inital: • patterns: 3 • relations: 3 iteration 0: • patterns: 6 (+3) • relations: 13 (+3) iteration 1: • patterns: 14 (+9) • relations: 0 total: • patterns: 23 • relations: 16
Interim Conclusions • Issue I: over-specificity of queries arguments Q: "Oren Etzioni" "Ph.D" "CMU" But, what if actual relevant mention includes:A: "Oren Etzioni", "doctorate" "Carnegie Mellon University".. ? • Possible avenues: • Larger dictionaries • Unquote query arguments? (allow for some variation) • Allow argument values to include random terms "Oren * Etzioni" This might incorporate more noise, and require additional queries to be issued per relation.
Interim Conclusions • Issue II: name and pronoun resolution Q: "Oren Etzioni" "Ph.D" "CMU" But, what if actual relevant mention includes:A: "He recieved his Ph.D from CMU in..." • Rate of occurance of "S/he..." in extracted relations • 1 pattern, 50 queries: 56.8% (96/169) • Possible avenues: • Identify homepages and extract names from titles, or other unambiguous sources on page • Pronoun resolution simple techniques?? (for example, identify immediate previous name mentions. This may require NER.)
Interim Conclusions • Issue III: compound sentences Q: "Oren Etzioni" "Ph.D" "CMU" But, what if actual relevant mention includes:A: "Oren Etzioni recieved his MS from <UNIVERSITY>, and his Ph.D from CMU" • Possible avenues: • Extensions to pattern extraction techinque • May require dependency parsing
Software / Resources • A generic search framework which allows asynchronous processing of search tasks, as well as "filter" tasks (processing of resulting URLs) • A URL caching implementation of Java 1.5's java.net.ResponseCache using Hibernate, supporting centralized caching and remote access
Extraction Extraction URL Generic Search Framework Extraction Search SearchProcessor Test run: 1 Search50 URLs169 Extractions15 seconds Search Search Tasks Filter Tasks Result Filter
Relation Validate Relation SearchProcessor Pattern Validate Pattern Search Framework System Flow
Extensions • Dictionaries - next slide • Simple pronoun resolution • Extraction validation metrics • URL of professor’s personal home page • Clustering of people / universities, or normalization of names • Identify biography section of personal home pages • Links incoming and outgoing from personal home page
Additional information • Dictionary of institution names • Tiny dictionary of degrees • E.g. Ph.D., B.S., B. Tech., etc • Map of domain names to institution names • E.g. cmu.edu : Carnegie Mellon University • This could be learned but we will leave that for another group!
Example extracted relations • Dictionary of institution names • Tiny dictionary of degrees • E.g. Ph.D., B.S., B. Tech., etc • Map of domain names to institution names • E.g. cmu.edu : Carnegie Mellon University • This could be learned but we will leave that for another group!