KnowItAll

KnowItAll April 5 2007 William Cohen

Announcements • Reminder: project presentations (or progress report) • Sign up for a 30min presentation (or else) • First pair of slots is April 17 • Last pair of slots is May 10 • William is out of town April 6-April 9 • So, no office hours Friday. • Next week: no critiques assigned • But I will lecture

Bootstrapping Hearst ‘92 Deeper linguistic features, free text… Stevenson & Greenwood 2005 Riloff & Jones ‘99 … Collins & Singer ‘99 Rosenfeld and Feldman 2006 BM’98 Learning, semi-supervised learning, dual feature spaces… Etzioni et al 2005 … Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers… De-emphasize duality, focus on distance between patterns. Clever idea for learning relation patterns & strong experimental results

Know It All

Architecture Set of (disjoint?) predicates to consider + two names for each • Context – keywords from user to filter out non-domain pages • … ? ~= [H92]

Architecture

Bootstrapping - 1 template rule “city” query

Bootstrapping - 2 Each discriminator U is a function: fU(x) = hits(“city x”)/hits(“x”) i.e. fU(“Pittsburgh”) = hits(“city Pittsburgh”)/hits(“Pittsburgh”) These are then used to create features: fU(x)>θ and fU(x)<θ

Bootstrapping - 3 • Submit the queries & apply the rules to produce initial seeds. • Evaluate each seed with each discriminator U: e.g., compute PMI stats like: |hits(“city Boston”)| / |hits(“Boston”)| • Take the top seeds from each class and call them POSITIVE then use disjointness of classes to find NEGATIVE seeds. • Train a NaiveBayes classifier using thresholded U’s as features.

Bootstrapping - 4 Estimate using the classifier based on the previously-trained discriminators Some ad hoc stopping conditions… (“signal to noise” ratio)

Architecture - 2

Extensions to KnowItAll • Problem: Unsupervised learning finds clusters—what if the text doesn’t support the clustering we want • Eg target is “scientist”, but natural clusters are “biologist”, “physicist”, “chemist” • Solution: subclass extraction • Modify template/rule system to extract subclasses of target class (eg scientist  chemist, biologist, …) • Check extracted subclasses with WordNet and/or PMI-like method (as for instances) • Extract from each subclass recursively

Extensions to KnowItAll • Problem: Set of rules is limited: • Derived from fixed set of “templates” (general patterns ~ from H92) • Solution 1: Pattern learning: augment the initial set of rules derivable from templates • Search for instances I on the web • Generate patterns: some substring of I in context: “b1 … b4 I a1 … a4” • Assume classes are disjoint and estimate recall/precision of each pattern P • Exclude patterns that cover only one seed (very low recall) • Take the top 200 remaining patterns and • Evaluate them as extractors “using PMI” (?) • Evaluate them as discriminators (in usual way?) Examples: “headquartered in <city>”, “<city> hotels”, …,

Extensions to KnowItAll • Solution 2: • List extraction: augment the initial set of rules with rules that are local to a specific web page • Search for pages containing small sets of instances (eg “London Paris Rome Pittsburgh”) • For each page P: • Find subtrees T of the DOM tree that contain >k seeds • Find longest common prefix/suffix of the seeds in T • [Some heuristics added to generalize this further] • Find all other strings inside T with the same prefix/suffix • Heuristically select the “best” wrapper for a page • Wrapper = P, T, prefix, suffix

T1 w1  Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator

T2 w1  Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w2  Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator

T3 w1  Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w2  Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w3  Italy, Japan, France, Israel, Spain, Brazil

T4 w1  Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w2  Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w3  Italy, Japan, France, Israel, Spain, Brazil w4  Italy, Japan

[…]

Results - City

Results - Film

Results - Scientist

Observations • Corpus is accessed indirectly thru Google API • Only use top k discriminators • Run extractors via query keywords & extract • Limited by network access time • Lots of moving parts to engineer • Rule templates • Signal-to-noise • LE wrapper evaluation details • Parameters: number of discriminators, number of seeds to keep, number of names per concept, ….

KnowItNow: Son of KnowItAll • Goal: faster results, not better results • Difference 1: • Store documents locally • Build local index (Bindings Engine) optimized for finding instances of KnowItAll rules and patterns • Based on inverted index term  (doc,position,contextInfo)

KnowItNow: Son of KnowItAll • Difference 2: • New model (URNS model) to merge information from multiple extraction rules • Intuition: instances generated from each extractor are assumed to be a mixture of two distributions • Random noise from large instance pool • Stuff with known structure (e.g., uniform, Zipf’s law, …) • Using EM you can estimate mixture probabilities and parameters of non-noisy data Prob(x noise|x extracted)

KnowItNow: Son of KnowItAll • Prob(noise)= 0.59 • Non-noisy data: uniform • over 137 instances 137 colors = 41% of mass 15,346 colors = 59% of mass … • Prob(noise)= 0.59 • Non-noisy data: Zipf’s • over >N instances 41% of mass fits powerlaw 59% of mass doesn’t …

KnowItAll

KnowItAll

Presentation Transcript

KnowItAll

Web-Scale Information Extraction in KNOWITALL (Preliminary Results)