Information Extraction Research @ Yahoo! Labs Bangalore

Information Extraction Research @ Yahoo! Labs Bangalore Rajeev RastogiYahoo! Labs Bangalore

The most visited site on the internet • 600 million+ users per month • Super popular properties • News, finance, sports • Answers, flickr, del.icio.us • Mail, messaging • Search

Unparalleled scale • 25 terabytes of data collected each day • Over 4 billion clicks every day • Over 4 billion emails per day • Over 6 billion instant messages per day • Over 20 billion web documents indexed • Over 4 billion images searchable No other company on the planet processes as much data as we do!

Yahoo! Labs Bangalore • Focus is on basic and applied research • Search • Advertizing • Cloud computing • University relations • Faculty research grants • Summer internships • Sharing data/computing infrastructure • Conference sponsorships • PhD co-op program

What does search look like today?

Search results of the future: Structured abstracts yelp.com Gawker babycenter New York Times epicurious LinkedIn answers.com webmd

Rank by price Search results of the future: Intelligent ranking

A key technology for enabling search transformation Information extraction (IE)

Information extraction (IE) • Goal: Extract structured records from Web pages Name Category Address Map Phone Price Reviews

Multiple verticals • Business, social networking, video, ….

Name Title Posted by Date Price Title Education Category Address Connections Phone Price Rating Views One schema per vertical

IE on the Web is a hard problem • Web pages are noisy • Pages belonging to different Web sites have different layouts Noise

Web page types Hand-crafted Template-based

Template-based pages • Pages within a Web site generated using scripts, have very similar structure • Can be leveraged for extraction • ~30% of crawled Web pages • Information rich, frequently appear in the top results of search queries • E.g. search query: “Chinese Mirch New York” • 9 template-based pages in the top 10 results

Wrapper Induction • Enables extraction from template-based pages Learn Sample pages Annotations Website pages Annotate Pages Sample Learn Wrappers Apply wrappers XPath Rules Extract Extract Website pages Records

Example Generalize XPath: /html/body/div/div/div/div/div/div/span /html/body//div//span

Filters • Apply filters to prune from multiple candidates that match XPath expression XPath: /html/body//div//span Regex Filter (Phone):([0-9]3) [0-9]3-[0-9]4

Limitations of wrappers • Won’t work across Web sites due to different page layouts • Scaling to thousands of sites can be a challenge • Need to learn a separate wrapper for each site • Annotating example pages from thousands of sites can be time-consuming & expensive

Research challenge • Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site • Only annotate pages from a few sites initially as training data

Conditional Random Fields (CRFs) • Models conditional probability distribution of label sequence y=y1,…,yn given input sequence x=x1,…,xn • fk: features, lk: weights • Choose lk to maximize log-likelihood of training data • Use Viterbi algorithm to compute label sequence y with highest probability

Name Noise Category Address Phone CRFs-based IE • Web pages can be viewed as labeled sequences • Train CRF using pages from few Web sites • Then use trained CRF to extract from remaining sites

Drawbacks of CRFs • Require too many training examples • Have been used previously to segment short strings with similar structure • However, may not work too well across Web sites that • contain long pages with lots of noise • have very different structure

An alternate approach that exploits site knowledge • Build attribute classifiers for each attribute • Use pages from a few initial Web sites • For each page from a new Web site • Segment page into sequence of fields (using static repeating text) • Use attribute classifiers to assign attribute labels to fields • Use constraints to disambiguate labels • Uniqueness: an attribute occurs at most once in a page • Proximity: attribute values appear close together in a page • Structural: relative positions of attributes are identical across pages of a Web site

Attribute classifiers + constraints example Chinese Mirch Chinese, Indian Page1: 120 Lexington AvenueNew York, NY 10016 (212) 532 3663 Category Phone Name Address Jewel of India Page2: Indian 15 W 44th StNew York, NY 10016 (212) 869 5544 Category Name Phone Address 21 Club Page3: American 21 W 52nd StNew York, NY 10019 (212) 582 7200 Phone Category, Name Name, Noise Address Uniqueness constraint: NamePrecedence constraint: Name < Category 21 Club Page3: American 21 W 52nd StNew York, NY 10019 (212) 582 7200 Phone Category Name Address

Performance evaluation: Datasets • 100 pages from 5 restaurant Web sites with very different structure • www.citysearch.com • www.fromers.com • www.nymag.com • www.superpages.com • www.yelp.com • Extract attributes: Name, Address, Phone num, Hours of operation, Description

Methods considered • CRFs, attribute classifiers + constraints • Features • Lexicon: Words in the training Web pages • Regex: isAlpha, isAllCaps, isNum, is5DigitNum, isDay,… • Attribute-level: Num of words, Overlap with title,…

Evaluation methodology • Metrics • Precision, recall, F1 for attributes • Test on one site, use pages from remaining 4 sites as training data • Average measures over all 5 sites

Experimental results Precision Recall

Other IE scenarios: Browse page extraction Similar-structuredrecords

IE big picture/taxonomy • Things to extract from • Template-based, browse, hand-crafted pages, text • Things to extract • Records, tables, lists, named entities • Techniques used • Structure-based (HTML tags, DOM tree paths) – e.g. Wrappers • Content-based (attribute values/models) – e.g. dictionaries • Structure + Content (sequential/hierarchical relationships among attribute values) – e.g. hierarchical CRFs • Level of automation • Manual, supervised, unsupervised

Information Extraction Research @ Yahoo! Labs Bangalore

Information Extraction Research @ Yahoo! Labs Bangalore

Presentation Transcript

Information Extraction and Integration: an Overview

Physics 110

Extraction Site Ridge Preservation

Toward Unified Models of Information Extraction and Data Mining

ONLY CONNECT

Information Extraction from Scientific Texts

Information Extraction

Trends in Scholarly Communication

Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web

Information Extraction

Information Extraction

Managing Information Extraction SIGMOD 2006 Tutorial

Extraction Metallurgy

Coupled Bayesian Sets Algorithm for Semi-supervised Learning and Information Extraction

Information Extraction

Streaming in a Connected World: Querying and Tracking Distributed Data Streams

信息抽取 (Information Extraction) 及其在数字图书馆中的应用研究

YUI’s CSS Foundation

Modeler Day 3

Rapid Training of Information Extraction with Local and Global Data Views

CERATOPS Center for Extraction and Summarization of Events and Opinions in Text