Towards Combining Web Classification and Web Information Extraction: a Case Study

Towards Combining Web Classification and Web Information Extraction: a Case Study Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^ *Hewlett-Packard Labs China ^Institute of Computing Technology, CAS

Web Content Analysis for Vertical Search • Web Information Extraction • Extract the metadata in the target pages • Web Classification • Identify the target pages product pages product name, model number, price … Web pages after crawling course title, ID, time, teacher … course homepages

OfCourse • Search engine for online courses • More than 60, 000 courses from the top 50 universities in US

Web Classification and Web Information Extraction • WC vs. WIE • Two sequential and separate phases • Error accumulation Web Content Analysis for Vertical Search Web Classification Web Information extraction

Contributions Web Content Analysis for Vertical Search Web Information extraction Web Classification Web Classification and Web Information Extraction Combine them by probabilistic model to achieve mutual enhancement

Motivating Examples (1) • No Course Title WIE Oracle • Lots of course-related terms on this page • WIE helps to improve the precision of WC

Motivating Examples (2) • No Course Title WIE Oracle • With Course Title • Few course-related terms on this page • WIE helps to improve the recall of WC

Problem Formulation (1) • Denotations • x, a given Web page • y, the class label of this page (indicating the type of the Web page for WC) • xi(i=1…k), a text DOM leaf node in the page x • yi (i=1…k),the class label of xi (indicating the type of the text node for WIE) • k, the number of text DOM leaf nodes in this page • Label assignment problem for both x and x1 … xk

Problem Formulation (2) • Given a Web page x with k text DOM nodes x1 … xk • Let y,y1…yk be one possible label assignment for x,x1…xk • The principle of Maximum A Posteriori for the label assignment problem

The Graphical Model • Undirected graphical model for combining WC and WIE

The Graphical Model • Undirected graphical model for combining WC and WIE maximal clique on x and y

The Graphical Model • Undirected graphical model for combining WC and WIE maximal clique on each xi and yi, k such kind of maximal cliques

The Graphical Model • Undirected graphical model for combining WC and WIE maximal clique on all label variables y,y1…yk January, 2009

Expressing the Conditional Probability • Adopting the form of CRFs January, 2009

Parameter Learning

Model Inference with Constrained Output (1) • The challenge: the normalization factor in the conditional probability • Exact computation when the structure of the elements in the vector y is simple • Approximate computation otherwise (fully connected y,y1…yk in our model)

Model Inference with Constrained Output (2) • Use the domain knowledge to constrain the output label space • A course homepage contains one and only one course title • A non course homepage do not contain a course title

Baseline Methods • Local training and separate inference • Train the two classifiers for WC and WIE respectively • Use these two classifiers sequentially when predicting • Local training and joint inference • Train the two classifiers for WC and WIE respectively • Use these two classifiers jointly when predicting

Experimental Results

Conclusions and Discussion • Tasks that are inherently joint should be addressed using only one model • WC and WIE • However, this definitely increase the complexity of the statistic model • This work is to show the possibility of this joint model with tractable complexity, which is achieved by adopting the domain assumption January, 2009

OfCourse • Open search engine • support interactively adding of the course data

Experimental Data • Positive data • 530 course homepages • Negative data • 1200 other web pages

Towards Combining Web Classification and Web Information Extraction: a Case Study

Towards Combining Web Classification and Web Information Extraction: a Case Study

Presentation Transcript

Combining Link and Content Information in Web Search

Classification Web

Web Data Extraction

Towards Web-Scale Information Extraction

Information Extraction from Web Documents

Open Information Extraction from the Web

Towards Domain-Independent Information Extraction from Web Tables

Case Study II: A Web Server

A Survey of WEB Information Extraction Systems

Case Study: Web Service

Towards a semantic web

Information Extraction on the Web

Toward Semantic Web Information Extraction

Towards a Semantic Web

Information extraction from web pages using extraction ontologies

Web scale Information Extraction

Web classification

Towards a Semantic Web

A Survey of WEB Information Extraction Systems

Information extraction from web pages using extraction ontologies

Study on Web Content Extraction Techniques