300 likes | 852 Views
Information Extraction Research @ Yahoo! Labs Bangalore. Rajeev Rastogi Yahoo! Labs Bangalore. The most visited site on the internet. 600 million+ users per month Super popular properties News, finance, sports Answers, flickr, del.icio.us Mail, messaging Search. Unparalleled scale.
E N D
Information Extraction Research @ Yahoo! Labs Bangalore Rajeev RastogiYahoo! Labs Bangalore
The most visited site on the internet • 600 million+ users per month • Super popular properties • News, finance, sports • Answers, flickr, del.icio.us • Mail, messaging • Search
Unparalleled scale • 25 terabytes of data collected each day • Over 4 billion clicks every day • Over 4 billion emails per day • Over 6 billion instant messages per day • Over 20 billion web documents indexed • Over 4 billion images searchable No other company on the planet processes as much data as we do!
Yahoo! Labs Bangalore • Focus is on basic and applied research • Search • Advertizing • Cloud computing • University relations • Faculty research grants • Summer internships • Sharing data/computing infrastructure • Conference sponsorships • PhD co-op program
Search results of the future: Structured abstracts yelp.com Gawker babycenter New York Times epicurious LinkedIn answers.com webmd
Rank by price Search results of the future: Intelligent ranking
A key technology for enabling search transformation Information extraction (IE)
Information extraction (IE) • Goal: Extract structured records from Web pages Name Category Address Map Phone Price Reviews
Multiple verticals • Business, social networking, video, ….
Name Title Posted by Date Price Title Education Category Address Connections Phone Price Rating Views One schema per vertical
IE on the Web is a hard problem • Web pages are noisy • Pages belonging to different Web sites have different layouts Noise
Web page types Hand-crafted Template-based
Template-based pages • Pages within a Web site generated using scripts, have very similar structure • Can be leveraged for extraction • ~30% of crawled Web pages • Information rich, frequently appear in the top results of search queries • E.g. search query: “Chinese Mirch New York” • 9 template-based pages in the top 10 results
Wrapper Induction • Enables extraction from template-based pages Learn Sample pages Annotations Website pages Annotate Pages Sample Learn Wrappers Apply wrappers XPath Rules Extract Extract Website pages Records
Example Generalize XPath: /html/body/div/div/div/div/div/div/span /html/body//div//span
Filters • Apply filters to prune from multiple candidates that match XPath expression XPath: /html/body//div//span Regex Filter (Phone):([0-9]3) [0-9]3-[0-9]4
Limitations of wrappers • Won’t work across Web sites due to different page layouts • Scaling to thousands of sites can be a challenge • Need to learn a separate wrapper for each site • Annotating example pages from thousands of sites can be time-consuming & expensive
Research challenge • Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site • Only annotate pages from a few sites initially as training data
Conditional Random Fields (CRFs) • Models conditional probability distribution of label sequence y=y1,…,yn given input sequence x=x1,…,xn • fk: features, lk: weights • Choose lk to maximize log-likelihood of training data • Use Viterbi algorithm to compute label sequence y with highest probability
Name Noise Category Address Phone CRFs-based IE • Web pages can be viewed as labeled sequences • Train CRF using pages from few Web sites • Then use trained CRF to extract from remaining sites
Drawbacks of CRFs • Require too many training examples • Have been used previously to segment short strings with similar structure • However, may not work too well across Web sites that • contain long pages with lots of noise • have very different structure
An alternate approach that exploits site knowledge • Build attribute classifiers for each attribute • Use pages from a few initial Web sites • For each page from a new Web site • Segment page into sequence of fields (using static repeating text) • Use attribute classifiers to assign attribute labels to fields • Use constraints to disambiguate labels • Uniqueness: an attribute occurs at most once in a page • Proximity: attribute values appear close together in a page • Structural: relative positions of attributes are identical across pages of a Web site
Attribute classifiers + constraints example Chinese Mirch Chinese, Indian Page1: 120 Lexington AvenueNew York, NY 10016 (212) 532 3663 Category Phone Name Address Jewel of India Page2: Indian 15 W 44th StNew York, NY 10016 (212) 869 5544 Category Name Phone Address 21 Club Page3: American 21 W 52nd StNew York, NY 10019 (212) 582 7200 Phone Category, Name Name, Noise Address Uniqueness constraint: NamePrecedence constraint: Name < Category 21 Club Page3: American 21 W 52nd StNew York, NY 10019 (212) 582 7200 Phone Category Name Address
Performance evaluation: Datasets • 100 pages from 5 restaurant Web sites with very different structure • www.citysearch.com • www.fromers.com • www.nymag.com • www.superpages.com • www.yelp.com • Extract attributes: Name, Address, Phone num, Hours of operation, Description
Methods considered • CRFs, attribute classifiers + constraints • Features • Lexicon: Words in the training Web pages • Regex: isAlpha, isAllCaps, isNum, is5DigitNum, isDay,… • Attribute-level: Num of words, Overlap with title,…
Evaluation methodology • Metrics • Precision, recall, F1 for attributes • Test on one site, use pages from remaining 4 sites as training data • Average measures over all 5 sites
Experimental results Precision Recall
Other IE scenarios: Browse page extraction Similar-structuredrecords
IE big picture/taxonomy • Things to extract from • Template-based, browse, hand-crafted pages, text • Things to extract • Records, tables, lists, named entities • Techniques used • Structure-based (HTML tags, DOM tree paths) – e.g. Wrappers • Content-based (attribute values/models) – e.g. dictionaries • Structure + Content (sequential/hierarchical relationships among attribute values) – e.g. hierarchical CRFs • Level of automation • Manual, supervised, unsupervised