An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites

An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites Tak-Lam Wong Dept. of Computer Science and Engineering The Chinese University of Hong Kong Wai Lam, Tik-Shun Wong Dept. of Systems Engineering and Engineering Management The Chinese University of Hong Kong @ SIGIR 2008 – Singapore

Presentation Outline • Introduction • Problem Definition • Our Model • Inference Method • Experimental Results • Conclusions @ SIGIR 2008 - Singapore

Motivation (Source: http://www.crayeon3.com) (Source: http://www.superwarehouse.com) @ SIGIR 2008 - Singapore

Information Extraction • To extract product attributes: • Prior knowledge about content • Effective sensor resolution • Layout format • White balance, shutter speed • Mutual influence • Light sensitivity @ SIGIR 2008 - Singapore

Attribute Normalization • Samples of extracted text fragments from a page: • Cloudy, daylight, etc… • What do they refer to? • A text fragment extracted from another page: • white balance auto, daylight,cloudy, tungsten, … … • Attribute normalization: • To cluster text fragments into the same group • Better indexing for product search • Easier understanding and interpretation @ SIGIR 2008 - Singapore

Existing Works • Supervised wrapper induction • They need training examples. • The wrapper learned from a Web site cannot be applied to other sites. • Template-independent extraction (Zhu et al., 2007) • They cannot handle previously unseen attributes. • Unsupervised wrapper learning (Crescenzi et al, 2001) • Extracted data are not normalized. @ SIGIR 2008 - Singapore

Contributions • Unsupervised learning framework for jointly extracting and normalizing product attributes from multiple Web sites. • Our framework considers page-independent content information and page-dependent layout information. • Can extract unlimited number of product attributes (Dirichlet process) • Can visualize the semantic meaning of each product attribute @ SIGIR 2008 - Singapore

Problem Definition (1) • A product domain, • E.g., Digital camera domain • A set of reference attributes, • E.g., “resolution”, “white balance”, etc. • A special element, , representing “not-an-attribute” • A collection of Web pages from any Web sites, , each of which contains a single product • Let be any text fragment from a Web page @ SIGIR 2008 - Singapore

Problem Definition (2) Line separator <TR> <TD> White balance </TD> <TD> Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom </TD> </TR> <TR> Line separator @ SIGIR 2008 - Singapore

Problem Definition (3) • Information extraction: • Attribute normalization: • Joint attribute extraction and normalization: Attribute information Target information Layout information Content information @ SIGIR 2008 - Singapore

Problem Definition (4) • White balance Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom • T=1 • A=“white balance” • “Cloudy, daylight” • T=1 • A=“white balance” • View larger image • T=0 • A=“not-an-attribute” @ SIGIR 2008 - Singapore

Our Model Dirichlet Process Prior(Infinite Mixture Model) S Different Web Site N Text Fragment k-th component proportion Content info. generation Target info. generation @ SIGIR 2008 - Singapore

Generation Process @ SIGIR 2008 - Singapore

Generation Process • The joint probability for generating a particular text fragment given the parameters, , , , and, : • Inference: • Intractable @ SIGIR 2008 - Singapore

Variational Method (1) • Finding is intractable • Our goal:Design a tractable distribution such that should be as close to as possible. • KL divergence: @ SIGIR 2008 - Singapore

Variational Method (2) • Truncated stick-breaking process (Ishwaran and James, 2001) • Replace infinity with a truncation level K • Max: @ SIGIR 2008 - Singapore

Variational Method (3) • One important variational parameters: • How likely does come from the k-th component? • Attribute normalization! @ SIGIR 2008 - Singapore

Variational Method (4) • Another important variational parameter: where • How likely should be extracted? • Attribute extraction! @ SIGIR 2008 - Singapore

Unsupervised Approach • What should be extracted? • Make use of the prior knowledge about a domain. • Only a few terms about the product attributes • E.g., resolution, light sensitivity, shutter speed, etc. • Can be easily obtained, for example, by just highlighting the attributes of a Web page • Initialization @ SIGIR 2008 - Singapore

Experiments • We have conducted experiments on four different domains: • Digital camera: 85 Web pages from 41 different sites • MP3 player: 96 Web pages from 62 different sites • Camcorder: 111 Web pages from 61 different sites • Restaurant: 29 Web pages from LA-Weekly Restaurant Guide • In each domain, we conducted 10 runs of experiments. • In each run, we randomly selected a Web page and use the attributes inside as prior knowledge. @ SIGIR 2008 - Singapore

Evaluation on Attribute Normalization • Baseline approach: • Agglomerative clustering • Edit distance between text fragments • Evaluation metrics: • Pairwise recall (R) • Pairwise precision (P) • Pairwise F1-measure (F) @ SIGIR 2008 - Singapore

Results of Attribute Normalization @ SIGIR 2008 - Singapore

Visualize the Normalized Attributes • The top five weighted terms in the ten largest normalized attributes in the digital camera domain: @ SIGIR 2008 - Singapore

Evaluation on Attribute Extraction • Surprisingly, in the restaurant domain, our framework achieves a performance (0.95 F1-measure) which is comparable to the supervised method (Muslea et al. 2001) @ SIGIR 2008 - Singapore

Conclusions (1) • We aim at simultaneously extracting and normalizing product attributes from Web pages collected from different sites. • Our method considers page-independent content information and the page-dependent layout information. • We have developed a graphical model, which employs Dirichlet process prior, to model the generation of text fragments in Web pages. @ SIGIR 2008 - Singapore

Conclusions (2) • An unsupervised inference algorithm based on variational method is designed. • We formally show that content and layout information can collaborate and improve both extraction and normalization performance under our model. • Experiments on four different domains have been conducted to show the robustness and effectiveness of our approach. @ SIGIR 2008 - Singapore

Questions and Answers

An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites

An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites

Presentation Transcript

Extracting and Structuring Web Data

Accessibility and Usability For Web Sites: An Introduction to Web Accessibility

Web Mining for Extracting Relations

Extracting Product Feature Assessments from Reviews

Extracting Topics from Product Reviews

Unsupervised Named-Entity Extraction from the Web: An Experimental Study

ELIJAH: Extracting Genealogy from the Web

Greedy Unsupervised Multiple Kernel Learning

Extracting and Structuring Web Data

Quality Framework and Guidelines for Cultural Web Sites

Extracting and Structuring Web Data

Extracting Structured Data from Web Page

Extracting Parallel Texts from Massive Web Documents

Extracting Structured Data from Web Pages

Extracting Semistructured Information from the Web

Extracting knowledge from the World Wide Web

Creating Multiple Web Sites using Multiple IP Addresses

Extracting and Normalizing Entity-Actions from Users’ comments

Extracting and Structuring Web Data

Extracting Product Details from Macys

Product Data Extracting from Safeway

Monitoring & Extracting Product Prices from Walmart eStore