230 likes | 246 Views
Explore combining web classification and web information extraction techniques using a probabilistic model to improve metadata extraction. Study the benefits and challenges of this approach through practical examples and graphical models. Learn about parameter learning, model inference, and experimental results. Enhance traditional methods for vertical search and data analysis.
E N D
Towards Combining Web Classification and Web Information Extraction: a Case Study Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^ *Hewlett-Packard Labs China ^Institute of Computing Technology, CAS
Web Content Analysis for Vertical Search • Web Information Extraction • Extract the metadata in the target pages • Web Classification • Identify the target pages product pages product name, model number, price … Web pages after crawling course title, ID, time, teacher … course homepages
OfCourse • Search engine for online courses • More than 60, 000 courses from the top 50 universities in US
Web Classification and Web Information Extraction • WC vs. WIE • Two sequential and separate phases • Error accumulation Web Content Analysis for Vertical Search Web Classification Web Information extraction
Contributions Web Content Analysis for Vertical Search Web Information extraction Web Classification Web Classification and Web Information Extraction Combine them by probabilistic model to achieve mutual enhancement
Motivating Examples (1) • No Course Title WIE Oracle • Lots of course-related terms on this page • WIE helps to improve the precision of WC
Motivating Examples (2) • No Course Title WIE Oracle • With Course Title • Few course-related terms on this page • WIE helps to improve the recall of WC
Problem Formulation (1) • Denotations • x, a given Web page • y, the class label of this page (indicating the type of the Web page for WC) • xi(i=1…k), a text DOM leaf node in the page x • yi (i=1…k),the class label of xi (indicating the type of the text node for WIE) • k, the number of text DOM leaf nodes in this page • Label assignment problem for both x and x1 … xk
Problem Formulation (2) • Given a Web page x with k text DOM nodes x1 … xk • Let y,y1…yk be one possible label assignment for x,x1…xk • The principle of Maximum A Posteriori for the label assignment problem
The Graphical Model • Undirected graphical model for combining WC and WIE
The Graphical Model • Undirected graphical model for combining WC and WIE maximal clique on x and y
The Graphical Model • Undirected graphical model for combining WC and WIE maximal clique on each xi and yi, k such kind of maximal cliques
The Graphical Model • Undirected graphical model for combining WC and WIE maximal clique on all label variables y,y1…yk January, 2009
Expressing the Conditional Probability • Adopting the form of CRFs January, 2009
Model Inference with Constrained Output (1) • The challenge: the normalization factor in the conditional probability • Exact computation when the structure of the elements in the vector y is simple • Approximate computation otherwise (fully connected y,y1…yk in our model)
Model Inference with Constrained Output (2) • Use the domain knowledge to constrain the output label space • A course homepage contains one and only one course title • A non course homepage do not contain a course title
Baseline Methods • Local training and separate inference • Train the two classifiers for WC and WIE respectively • Use these two classifiers sequentially when predicting • Local training and joint inference • Train the two classifiers for WC and WIE respectively • Use these two classifiers jointly when predicting
Conclusions and Discussion • Tasks that are inherently joint should be addressed using only one model • WC and WIE • However, this definitely increase the complexity of the statistic model • This work is to show the possibility of this joint model with tractable complexity, which is achieved by adopting the domain assumption January, 2009
OfCourse • Open search engine • support interactively adding of the course data
Experimental Data • Positive data • 530 course homepages • Negative data • 1200 other web pages