1 / 32

An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites

An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites. Tak-Lam Wong Dept. of Computer Science and Engineering The Chinese University of Hong Kong Wai Lam, Tik-Shun Wong Dept. of Systems Engineering and Engineering Management

dick
Download Presentation

An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites Tak-Lam Wong Dept. of Computer Science and Engineering The Chinese University of Hong Kong Wai Lam, Tik-Shun Wong Dept. of Systems Engineering and Engineering Management The Chinese University of Hong Kong @ SIGIR 2008 – Singapore

  2. Presentation Outline • Introduction • Problem Definition • Our Model • Inference Method • Experimental Results • Conclusions @ SIGIR 2008 - Singapore

  3. Motivation (Source: http://www.crayeon3.com) (Source: http://www.superwarehouse.com) @ SIGIR 2008 - Singapore

  4. Information Extraction • To extract product attributes: • Prior knowledge about content • Effective sensor resolution • Layout format • White balance, shutter speed • Mutual influence • Light sensitivity @ SIGIR 2008 - Singapore

  5. Attribute Normalization • Samples of extracted text fragments from a page: • Cloudy, daylight, etc… • What do they refer to? • A text fragment extracted from another page: • white balance auto, daylight,cloudy, tungsten, … … • Attribute normalization: • To cluster text fragments into the same group • Better indexing for product search • Easier understanding and interpretation @ SIGIR 2008 - Singapore

  6. Existing Works • Supervised wrapper induction • They need training examples. • The wrapper learned from a Web site cannot be applied to other sites. • Template-independent extraction (Zhu et al., 2007) • They cannot handle previously unseen attributes. • Unsupervised wrapper learning (Crescenzi et al, 2001) • Extracted data are not normalized. @ SIGIR 2008 - Singapore

  7. Contributions • Unsupervised learning framework for jointly extracting and normalizing product attributes from multiple Web sites. • Our framework considers page-independent content information and page-dependent layout information. • Can extract unlimited number of product attributes (Dirichlet process) • Can visualize the semantic meaning of each product attribute @ SIGIR 2008 - Singapore

  8. Presentation Outline • Introduction • Problem Definition • Our Model • Inference Method • Experimental Results • Conclusions @ SIGIR 2008 - Singapore

  9. Problem Definition (1) • A product domain, • E.g., Digital camera domain • A set of reference attributes, • E.g., “resolution”, “white balance”, etc. • A special element, , representing “not-an-attribute” • A collection of Web pages from any Web sites, , each of which contains a single product • Let be any text fragment from a Web page @ SIGIR 2008 - Singapore

  10. Problem Definition (2) Line separator <TR> <TD> <P> <SPAN> White balance </SPAN> </P> </TD> <TD> <P> <SPAN> Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom </SPAN> </P> </TD> </TR> <TR> Line separator @ SIGIR 2008 - Singapore

  11. Problem Definition (3) • Information extraction: • Attribute normalization: • Joint attribute extraction and normalization: Attribute information Target information Layout information Content information @ SIGIR 2008 - Singapore

  12. Problem Definition (4) • White balance Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom • T=1 • A=“white balance” • “Cloudy, daylight” • T=1 • A=“white balance” • View larger image • T=0 • A=“not-an-attribute” @ SIGIR 2008 - Singapore

  13. Presentation Outline • Introduction • Problem Definition • Our Model • Inference Method • Experimental Results • Conclusions @ SIGIR 2008 - Singapore

  14. Our Model Dirichlet Process Prior(Infinite Mixture Model) S Different Web Site N Text Fragment k-th component proportion Content info. generation Target info. generation @ SIGIR 2008 - Singapore

  15. Generation Process @ SIGIR 2008 - Singapore

  16. Generation Process • The joint probability for generating a particular text fragment given the parameters, , , , and, : • Inference: • Intractable @ SIGIR 2008 - Singapore

  17. Presentation Outline • Introduction • Problem Definition • Our Model • Inference Method • Experimental Results • Conclusions @ SIGIR 2008 - Singapore

  18. Variational Method (1) • Finding is intractable • Our goal:Design a tractable distribution such that should be as close to as possible. • KL divergence: @ SIGIR 2008 - Singapore

  19. Variational Method (2) • Truncated stick-breaking process (Ishwaran and James, 2001) • Replace infinity with a truncation level K • Max: @ SIGIR 2008 - Singapore

  20. Variational Method (3) • One important variational parameters: • How likely does come from the k-th component? • Attribute normalization! @ SIGIR 2008 - Singapore

  21. Variational Method (4) • Another important variational parameter: where • How likely should be extracted? • Attribute extraction! @ SIGIR 2008 - Singapore

  22. Unsupervised Approach • What should be extracted? • Make use of the prior knowledge about a domain. • Only a few terms about the product attributes • E.g., resolution, light sensitivity, shutter speed, etc. • Can be easily obtained, for example, by just highlighting the attributes of a Web page • Initialization @ SIGIR 2008 - Singapore

  23. Presentation Outline • Introduction • Problem Definition • Our Model • Inference Method • Experimental Results • Conclusions @ SIGIR 2008 - Singapore

  24. Experiments • We have conducted experiments on four different domains: • Digital camera: 85 Web pages from 41 different sites • MP3 player: 96 Web pages from 62 different sites • Camcorder: 111 Web pages from 61 different sites • Restaurant: 29 Web pages from LA-Weekly Restaurant Guide • In each domain, we conducted 10 runs of experiments. • In each run, we randomly selected a Web page and use the attributes inside as prior knowledge. @ SIGIR 2008 - Singapore

  25. Evaluation on Attribute Normalization • Baseline approach: • Agglomerative clustering • Edit distance between text fragments • Evaluation metrics: • Pairwise recall (R) • Pairwise precision (P) • Pairwise F1-measure (F) @ SIGIR 2008 - Singapore

  26. Results of Attribute Normalization @ SIGIR 2008 - Singapore

  27. Visualize the Normalized Attributes • The top five weighted terms in the ten largest normalized attributes in the digital camera domain: @ SIGIR 2008 - Singapore

  28. Evaluation on Attribute Extraction • Surprisingly, in the restaurant domain, our framework achieves a performance (0.95 F1-measure) which is comparable to the supervised method (Muslea et al. 2001) @ SIGIR 2008 - Singapore

  29. Presentation Outline • Introduction • Problem Definition • Our Model • Inference Method • Experimental Results • Conclusions @ SIGIR 2008 - Singapore

  30. Conclusions (1) • We aim at simultaneously extracting and normalizing product attributes from Web pages collected from different sites. • Our method considers page-independent content information and the page-dependent layout information. • We have developed a graphical model, which employs Dirichlet process prior, to model the generation of text fragments in Web pages. @ SIGIR 2008 - Singapore

  31. Conclusions (2) • An unsupervised inference algorithm based on variational method is designed. • We formally show that content and layout information can collaborate and improve both extraction and normalization performance under our model. • Experiments on four different domains have been conducted to show the robustness and effectiveness of our approach. @ SIGIR 2008 - Singapore

  32. Questions and Answers

More Related