180 likes | 278 Views
Information Extraction. MAS.S60 Catherine Havasi Rob Speer. Wikipedia as a corpus. 3.9 million English articles, 284 languages 2 billion words Brown has 1 million DBpedia and Freebase. Text reveals relations.
E N D
Information Extraction MAS.S60 Catherine Havasi Rob Speer
Wikipedia as a corpus • 3.9 million English articles, 284 languages • 2 billion words • Brown has 1 million • DBpedia and Freebase
Text reveals relations • “Various explanations of the overabundance of carbon, oxygen, nitrogen, and other elements have been proposed.” • “These were performed in town halls and other large buildings...” • “The splendid artistic legacy of Angkor Watand other Khmer monuments...”
NACLOpuzzle Would it be plausible to describe something as “danty but sloshful”?
Possible patterns • both X and Y • X but not Y • use NP to VP • [Un]fortunately, VP
TextRunner • Starts out with some seed patterns • Label: Uses those to label possible extractions in a sentence • Learn: Using a graphical model • Extract: Using the learned pattern, extract the sentence • Problem: 200,000 – 300,000 labeled training points needed
ReVerb • Syntactic Constraint • Requires extraction to match syntactic patterns • Lexical Constraint • Phrases must have many different arguments in the corpus
Accuracy of IE • Incoherent extractions make up 15-30% of extracted knowledge bits • Uninformative extractions 3-7%
Tom Mitchell (NELL) • Unsupervised learning machine
Named entities on Wikipedia? [[Pigeon photography]] is an [[aerial photography]] technique invented in 1907 by the German apothecary [[Julius Neubronner]]...
DownloadingWikipedia and other Wikimedia projects • A 2200-article sample is available on the class web site
Lab • Find an information pattern besides the ones we’ve listed • Run it over the Wikipedia front page corpus • Does it need a tagger? A named entity extractor?
Assignment • Choose and refine an information extractor • Hand-tag some examples • Add a classifier for good vs. bad matches • You are allowed to work in groups • Sharing code is fine, but one writeup per person