100 likes | 391 Views
Fact Extraction. Wikipedia Knowledge Extraction. Overview. Pronoun Resolution module Infobox extraction SRL parsing Improved refinement Clustering Hadoop compatibility. Pronoun Resolution Module.
E N D
Fact Extraction Wikipedia Knowledge Extraction
Overview • Pronoun Resolution module • Infobox extraction • SRL parsing • Improved refinement • Clustering • Hadoop compatibility
Pronoun Resolution Module • “His mother wanted him to get a good education so she sent him to live with his grandparents in Honolulu, HI” (Barack Obama)
Pronoun Resolution Module • “His mother wanted him to get a good education so she sent him to live with his grandparents in Honolulu, HI” (Barack Obama) • Current solution: replace pronouns with article title (very primitive) • Target solution: • Nobody in the world has solved this yet • Use an existing system that is usually correct? • Simple rules for common patterns?
Infobox extraction • Convert information into simple sentences: • Joe Biden is Barack Obama’s Vice President • Barack Obama is preceded by George W. Bush • Use type of phrase (Noun Phrase, Verb Phrase) to determine sentence to form. • Read papers from Turing Center (University of Washington)
SRL Parsing • Performs a deep analysis on each sentence. • E.g. “Yoshi has a long tongue which he uses to grab enemies and eat them.” • has (A0: Yoshi, A1: long tongue) • use (A0: Yoshi, A1: long tongue, A2: grab enemies and eat them) • Use SRL parsing to improve quality and representation of knowledge. • Problem: speed and complexity
Improved refinement • Current system has Subject, Object, Verb tuples • Problem: hard to define what words to incorporate in each phrase • E.g. “'The dog ( Canis lupus familiaris )' 'is' 'a mammal from the family Canidae‘” • The dog? dog? The dog ( Canis lupus familiaris )? • a mammal? a mammal from the family Canidae? • Possible solutions: • Different levels of information? • Simple rules based on part of speech tags?
Clustering • Idea: Determine whether two separate mentions point to the same concept • ‘The dog’, ‘a dog’, ‘dogs’ • ‘Cats’, ‘C.A.T.S’, ‘CAT Scan’ • ‘President Obama’, ‘President Barack Obama’ • Possible solutions: • Feature-based classification • Self organizing map • Terms associated
Hadoop Compatibility • Need to ensure scaling is possible for move to regular Wikipedia • Hadoop is an open source implementation of the Map-Reduce algorithm • Map-Reduce is an algorithm that parallelizes a process by splitting its iterations over several machines