690 likes | 841 Views
CS246. Extracting Structured Information from the Web. A Story of Nightmare. Spam Inc Task from your boss 10M Web pages Find all [person name, email] Big salary cut unless you collect 100,000 “quality records” in a week. How?. Any idea? Why such a task? Information is already there…
E N D
CS246 Extracting Structured Information from the Web
A Story of Nightmare • Spam Inc • Task from your boss • 10M Web pages • Find all [person name, email] • Big salary cut unless you collect 100,000 “quality records” in a week Junghoo "John" Cho (UCLA Computer Science)
How? • Any idea? • Why such a task? Information is already there… To use it for other programs: Use the addresses to send emails • For now let us ignore the techniques in the papers and see how we can approach the problem Junghoo "John" Cho (UCLA Computer Science)
Manual approach 10 sec/record 8640 records/day 60480 records/week Okay if 5 sec/record Solution 1 Junghoo "John" Cho (UCLA Computer Science)
Solution 2 • Write an “extraction rule” • Regular expression • Email: [A-Za-z]+@([A-Za-z]+.)+[A-Za-z] • Name: [A-Z][a-z]* [A-Z][a-z]* • Find all matches using the rule • Maybe “filter out” manually Junghoo "John" Cho (UCLA Computer Science)
Question • Do we have to construct an “extraction rule” for every task? • Can we automate “rule construction”? Junghoo "John" Cho (UCLA Computer Science)
General Problem Web pages or Plain text Structured data (John, john@cs) (Eric, eric@cs) (James, james@cs) Extraction Rule or Pattern How to generate it? Junghoo "John" Cho (UCLA Computer Science)
Basic Idea • Users provide small “examples” or a “training set” • Tag some [name, email] pairs from the data Junghoo "John" Cho (UCLA Computer Science)
Tagging Name Email Junghoo "John" Cho (UCLA Computer Science)
Basic Idea • Users provide small “examples” or a “training set” • Tag some [name, email] pairs from the data • System “generalize” the examples & derive a “rule” or “patterns” • Find common patterns among the tagged pairs Junghoo "John" Cho (UCLA Computer Science)
Pattern Generation • <TR><TD>Chu</TD><TD>chu@cs</TD>…<TR><TD>Cong</TD><TD>cong@cs</TD>..<TR><TD>Cho</TD><TD>cho@cs</TD>……<TR><TD>#Name</TD><TD>#Email</TD>! Junghoo "John" Cho (UCLA Computer Science)
Basic Idea • Users provide small “examples” or a “training set” • Tag some [name, email] pairs from the data • System “generalize” the examples & derive a “rule” or “patterns” • Find common patterns among the tagged pairs • Use the rule to extract other instances. Junghoo "John" Cho (UCLA Computer Science)
Fundamental Questions • How to generalize? • Examples patterns: how? • Pattern construction algorithm • How to express “patterns” or “rules” • Regular expression? Context-free grammar? • Pattern language • How to select the right pattern? • Many possible patterns. Which one to choose? • Evaluation function Junghoo "John" Cho (UCLA Computer Science)
Dual Questions • What kind of sources? • Unstructured vs. Regular • Plain text vs. Table • Noisy vs. Clean • What kind of data to extract? • Difficult to identify vs. Easy to describe • Name vs. Email • Single occurrences vs. Multiple occurrences • Email vs. Song title Junghoo "John" Cho (UCLA Computer Science)
Questions? Junghoo "John" Cho (UCLA Computer Science)
Book and Author paper • How many people understood it? • What is the problem? • What is the basic idea? • How many people got it? • How many people liked it? • What did you like/hate about the paper? Junghoo "John" Cho (UCLA Computer Science)
Basic Algorithm (1) • Start with a small example (Issac Asimov, The Robots of Dawn) (David Brin, Startide Rising) • Find all matches from Web pages (with surrounding text) • …<BR><LI><B>Startide Rising</B> by David Brin (2nd… • …book <LI><B>The Robots of Dawn</B> by Isaac Asimov (19… • Derive common patterns among matches <LI><B>#Book</B> by #Author ( Junghoo "John" Cho (UCLA Computer Science)
Basic Algorithm (2) • Find more examples using the pattern • <LI><B>#Book</B> by #Author ( …<LI><B>The Time Machine</B> by H.G. Wells (……<LI><B>The Lurker at the Threshold</B> by H.P. Lovedraft (… (H.G. Wells, The Time Machine)(H.P. Lovedraft, The Lurker at the Threshold) Junghoo "John" Cho (UCLA Computer Science)
Basic Algorithm (3) • Find more occurrences of the new examples • …book <I>The Time Machine</I> by H.G. Wells (……<LI><I>The Lurker at the Threshold</I> by H.P. Lovedraft (… • Derive more rules based on the matches • <I>#Book</I> by #Author • Repeat the process Junghoo "John" Cho (UCLA Computer Science)
Examples MatchingStrings Patterns MoreExamples (Asimov, Dawn) <I>Dawn</I> by Asimov ( <I>#Book</I> by #Author ( (Brin, Star) Basic Algorithm (Summary) Junghoo "John" Cho (UCLA Computer Science)
Examples MatchingStrings Patterns MoreExamples (Asimov, Dawn) <I>Dawn</I> by Asimov ( <I>#Book</I> by #Author ( (Brin, Star) Basic Algorithm (Summary) Junghoo "John" Cho (UCLA Computer Science)
Result • 23M Web pages • 5 examples • 5 Iterations • 1 Manual filtering 15,257 pairs with few errors Junghoo "John" Cho (UCLA Computer Science)
What’s New? • No tagging. Simple examples • (Pattern, Relation) duality • Conceptually elegant • Feedback loop • Why don’t we use learned examples? • Small initial sample • Promising results Junghoo "John" Cho (UCLA Computer Science)
Problems of Feedback Loop • What if there are erroneous examples? • Expand to meaningless data? Junghoo "John" Cho (UCLA Computer Science)
What Did the Author Do? • Manual filtering in 4th iteration • Stopped iteration after 5 iterations • Specificity factor • |middle| x |prefix| x |suffix| x |urlprefix| • Adopt a pattern if it has a long prefix, suffix and/or mid-string • Limit rules to a very specific URL space • Rule includes URL prefix Junghoo "John" Cho (UCLA Computer Science)
Divergence? • Another experiment • Initial examples: Baseball team names • Data: Newspaper articles • Results: All sports team names • Given a set of examples, where would it converge? Junghoo "John" Cho (UCLA Computer Science)
How to Control Divergence? • Example Pattern • More than k examples • Pattern Example • More than k patterns Junghoo "John" Cho (UCLA Computer Science)
Matrix Interpretation • Rows: Examples (Items) • We assume a hypothetical set of all examples occurring in the data • Columns: Patterns • We assume a hypothetical set of all patterns that can be derived • Cell[i, j] = 1 iff jth pattern matches ith example • Row[i] = (Book of worm, Asimov) • Column[j] = <I>#Book</> by #Author • Cell[i, j] = 1 if “<I>Book of worm</I> by Asimov” exists Junghoo "John" Cho (UCLA Computer Science)
Matrix Example … Patterns <I>.</I> <B>.</B> <U>.</U> Items (A, B) (C, D) (C, A) (D, E) (S, L) (N, U) … Junghoo "John" Cho (UCLA Computer Science)
How to Control Divergence? • Example Pattern • More than k examples • Pattern Example • More than k patterns • Fix the matrix! Junghoo "John" Cho (UCLA Computer Science)
How to Change Matrix? • Change Row? • Filter out noise from data • Use only the pages mentioning “books” • Classify pages based on word frequency • Identify only “relevant” part of pages • Identify only “structured” part of pages • List? • Tables? Junghoo "John" Cho (UCLA Computer Science)
How to Change Matrix? • Change Column? • Use different pattern language • E.g., the author used “url prefix” • Context-free grammar? • What will be a good pattern space? Junghoo "John" Cho (UCLA Computer Science)
Fundamental Questions • How to express “patterns” or “rules” • Pattern language • How to examples patterns? • Pattern construction algorithm • How to select the right one? • Evaluation function Junghoo "John" Cho (UCLA Computer Science)
Pattern Language? • Very limited regular expression • With URL filter • URL filter seems to be important to minimize noise • [prefix] #book [midstring] #author [suffix] Junghoo "John" Cho (UCLA Computer Science)
Pattern Construction Algorithm? • Group matching strings based on “mid-string” • Find longest prefix, suffix and URL-prefix • If the pattern is long enough, adopt it Junghoo "John" Cho (UCLA Computer Science)
Evaluation Function? • The longer, the better. • Specificity factor • |middle| x |prefix| x |suffix| x |urlprefix| • To minimize noise Junghoo "John" Cho (UCLA Computer Science)
Dual Question • Regular vs. Unstructured source • Relatively regular source required • Noisy vs. Clean source • General noise okay • Single vs. Multiple occurrences • Multiple occurrence Junghoo "John" Cho (UCLA Computer Science)
Would It Work? • [name, phone number] Junghoo "John" Cho (UCLA Computer Science)
Would It Work? • [name, phone number]? • No: • [mid-string] not fixed • More expressive pattern language • HTML parse-tree based? Junghoo "John" Cho (UCLA Computer Science)
Any Other Questions? Junghoo "John" Cho (UCLA Computer Science)
RoadRunner • What is the problem? • What is the main idea? Junghoo "John" Cho (UCLA Computer Science)
Key Observation • Many Web pages generated from structured database • These pages are based on “templates”, thus follow extremely regular structure • We can extract data by identifying “different parts” Junghoo "John" Cho (UCLA Computer Science)
Key Idea • Compare two pages • Extract different parts Junghoo "John" Cho (UCLA Computer Science)
<HTML> Books of: <B> John Smith </B> Title: <I> DB Primer </I> </HTML> <HTML> Books of: <B> Paul Jones </B> Title: <I> XML at Work </I> </HTML> Simplest Case Mismatch! Mismatch! Junghoo "John" Cho (UCLA Computer Science)
<HTML> Books of: <B> </B> Title: <I> </I> </HTML> <HTML> Books of: <B> </B> Title: <I> </I> </HTML> Simplest Case Template Junghoo "John" Cho (UCLA Computer Science)
John Smith DB Primer Paul Jones XML at Work Simplest Case Data Junghoo "John" Cho (UCLA Computer Science)
What Other Cases? Junghoo "John" Cho (UCLA Computer Science)
Repeated Items (from Amazon) Junghoo "John" Cho (UCLA Computer Science)
Missing Items (from Amazon) No Image! Junghoo "John" Cho (UCLA Computer Science)
Varying Items (from Amazon) Item varies! Junghoo "John" Cho (UCLA Computer Science)