Natural Language Processing for the Web

Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116 Office Hours: Thurs 12-1, 8-9

Logistics • Class evaluation • Please do • If there were topics you particularly liked, please say so • If there were topics you particularly disliked, please so • Anything you particularly liked or disliked about class format • Project presentations • Need eight people to go first, April 29th • Not necessary to have all results • 2nd date: May 13, 7:10pm UNLESS…. • Sign up by end of class or I will sign you up  : http://www.cs.columbia.edu/~kathy/NLPWeb/finalpresentations.htm

Machine Reading • Goal: to read all texts on the web, extract all knowledge and represent in DB/KB format • DARPA program on machine reading

Issues • Background theory and text facts may be inconsistent • -> probabilistic representation • Beliefs may only be implicit • -> need inference • Supervised learning not an option due to variety of relations on the web • -> IE not a valid solution • May require many steps of entailment • -> Need more general approach than textual entailment

Initial Approaches • Systems that learn relations using examples (supervised) • Systems that learn how to learn patterns using a seed set: SNOBALL (semi-supervised) • Systems that can label their own training examples using domain independent patterns: KNOWITALL (self-supervised)

KnowItAll • Require no hand-tagged data • A generic pattern • <Class> such as <Mem> • Learn Seattle, New York City, London as examples of cities • Learn new patterns “Headquartered in <city>” to learn more cities • Problem: relation-specific requiring bootstrapping for each relation

TextRunner “The use of NERs as well as syntactic or dependency parsers is a common thread that unifies most previous work. But this rather “heavy” linguistic technology runs into problems when applied to the heterogeneous text found on the Web.” • Self-supervised learner • Given a small corpus as example • Uses Stanford parser • Retains tuples if: • Finds all entities in the parse • Keeps tuples if there is a dependency between 2 entities shorter than a cerrtain length • The path from e1 to e2 does not cross a sentence like boundary (e.g., rel clause) • Neither e1 or e2 are a pronoun • Learns a classifier that tags tuples as “trustworthy” • Each tuple converted to a feature vector • Feature = POS sequence • Number of stop words in r • Number of tokens in r • Learned classifier contains no relation-specific or lexical features • Single pass extractor • No parsing but POS tagging and lightweight NP chunker • Entities = NP chunks • Relations words in between but heursitically eliminating words like prepositions • Generates one or more candidate tuples per sentence and retains one that classifier determines are trustworthy • Redundancy-based Assessor • Assigns a probability to each one based on a probablistic model of redundancy

TextRunner Capabilities • Tuple outputs are placed in a graph • TextRunner operates at large scale, processing 90 million web pages, producing 1 billion tuples, with estimated 70% accuracy • Problems: inconsistencies, polysemy, synonymy, entity duplication

How close are we to realizing the dream of machine reading?

Natural Language Processing for the Web