1 / 23

Towards Combining Web Classification and Web Information Extraction: a Case Study

Explore combining web classification and web information extraction techniques using a probabilistic model to improve metadata extraction. Study the benefits and challenges of this approach through practical examples and graphical models. Learn about parameter learning, model inference, and experimental results. Enhance traditional methods for vertical search and data analysis.

goza
Download Presentation

Towards Combining Web Classification and Web Information Extraction: a Case Study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Combining Web Classification and Web Information Extraction: a Case Study Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^ *Hewlett-Packard Labs China ^Institute of Computing Technology, CAS

  2. Web Content Analysis for Vertical Search • Web Information Extraction • Extract the metadata in the target pages • Web Classification • Identify the target pages product pages product name, model number, price … Web pages after crawling course title, ID, time, teacher … course homepages

  3. OfCourse • Search engine for online courses • More than 60, 000 courses from the top 50 universities in US

  4. Web Classification and Web Information Extraction • WC vs. WIE • Two sequential and separate phases • Error accumulation Web Content Analysis for Vertical Search Web Classification Web Information extraction

  5. Contributions Web Content Analysis for Vertical Search Web Information extraction Web Classification Web Classification and Web Information Extraction Combine them by probabilistic model to achieve mutual enhancement

  6. Motivating Examples (1) • No Course Title WIE Oracle • Lots of course-related terms on this page • WIE helps to improve the precision of WC

  7. Motivating Examples (2) • No Course Title WIE Oracle • With Course Title • Few course-related terms on this page • WIE helps to improve the recall of WC

  8. Problem Formulation (1) • Denotations • x, a given Web page • y, the class label of this page (indicating the type of the Web page for WC) • xi(i=1…k), a text DOM leaf node in the page x • yi (i=1…k),the class label of xi (indicating the type of the text node for WIE) • k, the number of text DOM leaf nodes in this page • Label assignment problem for both x and x1 … xk

  9. Problem Formulation (2) • Given a Web page x with k text DOM nodes x1 … xk • Let y,y1…yk be one possible label assignment for x,x1…xk • The principle of Maximum A Posteriori for the label assignment problem

  10. The Graphical Model • Undirected graphical model for combining WC and WIE

  11. The Graphical Model • Undirected graphical model for combining WC and WIE maximal clique on x and y

  12. The Graphical Model • Undirected graphical model for combining WC and WIE maximal clique on each xi and yi, k such kind of maximal cliques

  13. The Graphical Model • Undirected graphical model for combining WC and WIE maximal clique on all label variables y,y1…yk January, 2009

  14. Expressing the Conditional Probability • Adopting the form of CRFs January, 2009

  15. Parameter Learning

  16. Model Inference with Constrained Output (1) • The challenge: the normalization factor in the conditional probability • Exact computation when the structure of the elements in the vector y is simple • Approximate computation otherwise (fully connected y,y1…yk in our model)

  17. Model Inference with Constrained Output (2) • Use the domain knowledge to constrain the output label space • A course homepage contains one and only one course title • A non course homepage do not contain a course title

  18. Baseline Methods • Local training and separate inference • Train the two classifiers for WC and WIE respectively • Use these two classifiers sequentially when predicting • Local training and joint inference • Train the two classifiers for WC and WIE respectively • Use these two classifiers jointly when predicting

  19. Experimental Results

  20. Conclusions and Discussion • Tasks that are inherently joint should be addressed using only one model • WC and WIE • However, this definitely increase the complexity of the statistic model • This work is to show the possibility of this joint model with tractable complexity, which is achieved by adopting the domain assumption January, 2009

  21. OfCourse • Open search engine • support interactively adding of the course data

  22. Experimental Data • Positive data • 530 course homepages • Negative data • 1200 other web pages

More Related