320 likes | 473 Views
An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites. Tak-Lam Wong Dept. of Computer Science and Engineering The Chinese University of Hong Kong Wai Lam, Tik-Shun Wong Dept. of Systems Engineering and Engineering Management
E N D
An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites Tak-Lam Wong Dept. of Computer Science and Engineering The Chinese University of Hong Kong Wai Lam, Tik-Shun Wong Dept. of Systems Engineering and Engineering Management The Chinese University of Hong Kong @ SIGIR 2008 – Singapore
Presentation Outline • Introduction • Problem Definition • Our Model • Inference Method • Experimental Results • Conclusions @ SIGIR 2008 - Singapore
Motivation (Source: http://www.crayeon3.com) (Source: http://www.superwarehouse.com) @ SIGIR 2008 - Singapore
Information Extraction • To extract product attributes: • Prior knowledge about content • Effective sensor resolution • Layout format • White balance, shutter speed • Mutual influence • Light sensitivity @ SIGIR 2008 - Singapore
Attribute Normalization • Samples of extracted text fragments from a page: • Cloudy, daylight, etc… • What do they refer to? • A text fragment extracted from another page: • white balance auto, daylight,cloudy, tungsten, … … • Attribute normalization: • To cluster text fragments into the same group • Better indexing for product search • Easier understanding and interpretation @ SIGIR 2008 - Singapore
Existing Works • Supervised wrapper induction • They need training examples. • The wrapper learned from a Web site cannot be applied to other sites. • Template-independent extraction (Zhu et al., 2007) • They cannot handle previously unseen attributes. • Unsupervised wrapper learning (Crescenzi et al, 2001) • Extracted data are not normalized. @ SIGIR 2008 - Singapore
Contributions • Unsupervised learning framework for jointly extracting and normalizing product attributes from multiple Web sites. • Our framework considers page-independent content information and page-dependent layout information. • Can extract unlimited number of product attributes (Dirichlet process) • Can visualize the semantic meaning of each product attribute @ SIGIR 2008 - Singapore
Presentation Outline • Introduction • Problem Definition • Our Model • Inference Method • Experimental Results • Conclusions @ SIGIR 2008 - Singapore
Problem Definition (1) • A product domain, • E.g., Digital camera domain • A set of reference attributes, • E.g., “resolution”, “white balance”, etc. • A special element, , representing “not-an-attribute” • A collection of Web pages from any Web sites, , each of which contains a single product • Let be any text fragment from a Web page @ SIGIR 2008 - Singapore
Problem Definition (2) Line separator <TR> <TD> <P> <SPAN> White balance </SPAN> </P> </TD> <TD> <P> <SPAN> Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom </SPAN> </P> </TD> </TR> <TR> Line separator @ SIGIR 2008 - Singapore
Problem Definition (3) • Information extraction: • Attribute normalization: • Joint attribute extraction and normalization: Attribute information Target information Layout information Content information @ SIGIR 2008 - Singapore
Problem Definition (4) • White balance Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom • T=1 • A=“white balance” • “Cloudy, daylight” • T=1 • A=“white balance” • View larger image • T=0 • A=“not-an-attribute” @ SIGIR 2008 - Singapore
Presentation Outline • Introduction • Problem Definition • Our Model • Inference Method • Experimental Results • Conclusions @ SIGIR 2008 - Singapore
Our Model Dirichlet Process Prior(Infinite Mixture Model) S Different Web Site N Text Fragment k-th component proportion Content info. generation Target info. generation @ SIGIR 2008 - Singapore
Generation Process @ SIGIR 2008 - Singapore
Generation Process • The joint probability for generating a particular text fragment given the parameters, , , , and, : • Inference: • Intractable @ SIGIR 2008 - Singapore
Presentation Outline • Introduction • Problem Definition • Our Model • Inference Method • Experimental Results • Conclusions @ SIGIR 2008 - Singapore
Variational Method (1) • Finding is intractable • Our goal:Design a tractable distribution such that should be as close to as possible. • KL divergence: @ SIGIR 2008 - Singapore
Variational Method (2) • Truncated stick-breaking process (Ishwaran and James, 2001) • Replace infinity with a truncation level K • Max: @ SIGIR 2008 - Singapore
Variational Method (3) • One important variational parameters: • How likely does come from the k-th component? • Attribute normalization! @ SIGIR 2008 - Singapore
Variational Method (4) • Another important variational parameter: where • How likely should be extracted? • Attribute extraction! @ SIGIR 2008 - Singapore
Unsupervised Approach • What should be extracted? • Make use of the prior knowledge about a domain. • Only a few terms about the product attributes • E.g., resolution, light sensitivity, shutter speed, etc. • Can be easily obtained, for example, by just highlighting the attributes of a Web page • Initialization @ SIGIR 2008 - Singapore
Presentation Outline • Introduction • Problem Definition • Our Model • Inference Method • Experimental Results • Conclusions @ SIGIR 2008 - Singapore
Experiments • We have conducted experiments on four different domains: • Digital camera: 85 Web pages from 41 different sites • MP3 player: 96 Web pages from 62 different sites • Camcorder: 111 Web pages from 61 different sites • Restaurant: 29 Web pages from LA-Weekly Restaurant Guide • In each domain, we conducted 10 runs of experiments. • In each run, we randomly selected a Web page and use the attributes inside as prior knowledge. @ SIGIR 2008 - Singapore
Evaluation on Attribute Normalization • Baseline approach: • Agglomerative clustering • Edit distance between text fragments • Evaluation metrics: • Pairwise recall (R) • Pairwise precision (P) • Pairwise F1-measure (F) @ SIGIR 2008 - Singapore
Results of Attribute Normalization @ SIGIR 2008 - Singapore
Visualize the Normalized Attributes • The top five weighted terms in the ten largest normalized attributes in the digital camera domain: @ SIGIR 2008 - Singapore
Evaluation on Attribute Extraction • Surprisingly, in the restaurant domain, our framework achieves a performance (0.95 F1-measure) which is comparable to the supervised method (Muslea et al. 2001) @ SIGIR 2008 - Singapore
Presentation Outline • Introduction • Problem Definition • Our Model • Inference Method • Experimental Results • Conclusions @ SIGIR 2008 - Singapore
Conclusions (1) • We aim at simultaneously extracting and normalizing product attributes from Web pages collected from different sites. • Our method considers page-independent content information and the page-dependent layout information. • We have developed a graphical model, which employs Dirichlet process prior, to model the generation of text fragments in Web pages. @ SIGIR 2008 - Singapore
Conclusions (2) • An unsupervised inference algorithm based on variational method is designed. • We formally show that content and layout information can collaborate and improve both extraction and normalization performance under our model. • Experiments on four different domains have been conducted to show the robustness and effectiveness of our approach. @ SIGIR 2008 - Singapore