An unsupervised framework for extracting and normalizing product attributes from multiple web sites
Download
1 / 32

An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites - PowerPoint PPT Presentation


  • 131 Views
  • Uploaded on

An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites. Tak-Lam Wong Dept. of Computer Science and Engineering The Chinese University of Hong Kong Wai Lam, Tik-Shun Wong Dept. of Systems Engineering and Engineering Management

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites' - dick


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
An unsupervised framework for extracting and normalizing product attributes from multiple web sites

An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites

Tak-Lam Wong

Dept. of Computer Science and Engineering

The Chinese University of Hong Kong

Wai Lam, Tik-Shun Wong

Dept. of Systems Engineering and Engineering Management

The Chinese University of Hong Kong

@ SIGIR 2008 – Singapore


Presentation outline
Presentation Outline Product Attributes from Multiple Web Sites

  • Introduction

  • Problem Definition

  • Our Model

  • Inference Method

  • Experimental Results

  • Conclusions

@ SIGIR 2008 - Singapore


Motivation
Motivation Product Attributes from Multiple Web Sites

(Source: http://www.crayeon3.com)

(Source: http://www.superwarehouse.com)

@ SIGIR 2008 - Singapore


Information extraction
Information Extraction Product Attributes from Multiple Web Sites

  • To extract product attributes:

  • Prior knowledge about content

    • Effective sensor resolution

  • Layout format

    • White balance, shutter speed

  • Mutual influence

    • Light sensitivity

@ SIGIR 2008 - Singapore


Attribute normalization
Attribute Normalization Product Attributes from Multiple Web Sites

  • Samples of extracted text fragments from a page:

    • Cloudy, daylight, etc…

    • What do they refer to?

  • A text fragment extracted from another page:

    • white balance auto, daylight,cloudy, tungsten, … …

  • Attribute normalization:

    • To cluster text fragments into the same group

    • Better indexing for product search

    • Easier understanding and interpretation

@ SIGIR 2008 - Singapore


Existing works
Existing Works Product Attributes from Multiple Web Sites

  • Supervised wrapper induction

    • They need training examples.

    • The wrapper learned from a Web site cannot be applied to other sites.

  • Template-independent extraction (Zhu et al., 2007)

    • They cannot handle previously unseen attributes.

  • Unsupervised wrapper learning (Crescenzi et al, 2001)

    • Extracted data are not normalized.

@ SIGIR 2008 - Singapore


Contributions
Contributions Product Attributes from Multiple Web Sites

  • Unsupervised learning framework for jointly extracting and normalizing product attributes from multiple Web sites.

  • Our framework considers page-independent content information and page-dependent layout information.

  • Can extract unlimited number of product attributes (Dirichlet process)

  • Can visualize the semantic meaning of each product attribute

@ SIGIR 2008 - Singapore


Presentation outline1
Presentation Outline Product Attributes from Multiple Web Sites

  • Introduction

  • Problem Definition

  • Our Model

  • Inference Method

  • Experimental Results

  • Conclusions

@ SIGIR 2008 - Singapore


Problem definition 1
Problem Definition (1) Product Attributes from Multiple Web Sites

  • A product domain,

    • E.g., Digital camera domain

  • A set of reference attributes,

    • E.g., “resolution”, “white balance”, etc.

    • A special element, , representing “not-an-attribute”

  • A collection of Web pages from any Web sites, , each of which contains a single product

  • Let be any text fragment from a Web page

@ SIGIR 2008 - Singapore


Problem definition 2
Problem Definition (2) Product Attributes from Multiple Web Sites

Line separator

<TR>

<TD>

<P>

<SPAN>

White balance

</SPAN>

</P>

</TD>

<TD>

<P>

<SPAN>

Auto, daylight, cloudy, tungstem,

fluorescent, fluorescent H, custom

</SPAN>

</P>

</TD>

</TR>

<TR>

Line separator

@ SIGIR 2008 - Singapore


Problem definition 3
Problem Definition (3) Product Attributes from Multiple Web Sites

  • Information extraction:

  • Attribute normalization:

  • Joint attribute extraction and normalization:

Attribute information

Target information

Layout information

Content information

@ SIGIR 2008 - Singapore


Problem definition 4
Problem Definition (4) Product Attributes from Multiple Web Sites

  • White balance Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom

    • T=1

    • A=“white balance”

  • “Cloudy, daylight”

    • T=1

    • A=“white balance”

  • View larger image

    • T=0

    • A=“not-an-attribute”

@ SIGIR 2008 - Singapore


Presentation outline2
Presentation Outline Product Attributes from Multiple Web Sites

  • Introduction

  • Problem Definition

  • Our Model

  • Inference Method

  • Experimental Results

  • Conclusions

@ SIGIR 2008 - Singapore


Our model
Our Model Product Attributes from Multiple Web Sites

Dirichlet Process Prior(Infinite Mixture Model)

S Different Web Site

N Text Fragment

k-th component proportion

Content info. generation

Target info. generation

@ SIGIR 2008 - Singapore


Generation process
Generation Process Product Attributes from Multiple Web Sites

@ SIGIR 2008 - Singapore


Generation process1
Generation Process Product Attributes from Multiple Web Sites

  • The joint probability for generating a particular text fragment given the parameters, , , , and, :

  • Inference:

  • Intractable

@ SIGIR 2008 - Singapore


Presentation outline3
Presentation Outline Product Attributes from Multiple Web Sites

  • Introduction

  • Problem Definition

  • Our Model

  • Inference Method

  • Experimental Results

  • Conclusions

@ SIGIR 2008 - Singapore


Variational method 1
Variational Method (1) Product Attributes from Multiple Web Sites

  • Finding is intractable

  • Our goal:Design a tractable distribution such that should be as close to as possible.

  • KL divergence:

@ SIGIR 2008 - Singapore


Variational method 2
Variational Method (2) Product Attributes from Multiple Web Sites

  • Truncated stick-breaking process (Ishwaran and James, 2001)

    • Replace infinity with a truncation level K

  • Max:

@ SIGIR 2008 - Singapore


Variational method 3
Variational Method (3) Product Attributes from Multiple Web Sites

  • One important variational parameters:

    • How likely does come from the k-th component?

    • Attribute normalization!

@ SIGIR 2008 - Singapore


Variational method 4
Variational Method (4) Product Attributes from Multiple Web Sites

  • Another important variational parameter:

    where

    • How likely should be extracted?

    • Attribute extraction!

@ SIGIR 2008 - Singapore


Unsupervised approach
Unsupervised Approach Product Attributes from Multiple Web Sites

  • What should be extracted?

  • Make use of the prior knowledge about a domain.

    • Only a few terms about the product attributes

    • E.g., resolution, light sensitivity, shutter speed, etc.

    • Can be easily obtained, for example, by just highlighting the attributes of a Web page

    • Initialization

@ SIGIR 2008 - Singapore


Presentation outline4
Presentation Outline Product Attributes from Multiple Web Sites

  • Introduction

  • Problem Definition

  • Our Model

  • Inference Method

  • Experimental Results

  • Conclusions

@ SIGIR 2008 - Singapore


Experiments
Experiments Product Attributes from Multiple Web Sites

  • We have conducted experiments on four different domains:

    • Digital camera: 85 Web pages from 41 different sites

    • MP3 player: 96 Web pages from 62 different sites

    • Camcorder: 111 Web pages from 61 different sites

    • Restaurant: 29 Web pages from LA-Weekly Restaurant Guide

  • In each domain, we conducted 10 runs of experiments.

  • In each run, we randomly selected a Web page and use the attributes inside as prior knowledge.

@ SIGIR 2008 - Singapore


Evaluation on attribute normalization
Evaluation on Attribute Normalization Product Attributes from Multiple Web Sites

  • Baseline approach:

    • Agglomerative clustering

    • Edit distance between text fragments

  • Evaluation metrics:

    • Pairwise recall (R)

    • Pairwise precision (P)

    • Pairwise F1-measure (F)

@ SIGIR 2008 - Singapore


Results of attribute normalization
Results of Attribute Normalization Product Attributes from Multiple Web Sites

@ SIGIR 2008 - Singapore


Visualize the normalized attributes
Visualize the Normalized Attributes Product Attributes from Multiple Web Sites

  • The top five weighted terms in the ten largest normalized attributes in the digital camera domain:

@ SIGIR 2008 - Singapore


Evaluation on attribute extraction
Evaluation on Attribute Extraction Product Attributes from Multiple Web Sites

  • Surprisingly, in the restaurant domain, our framework achieves a performance (0.95 F1-measure) which is comparable to the supervised method (Muslea et al. 2001)

@ SIGIR 2008 - Singapore


Presentation outline5
Presentation Outline Product Attributes from Multiple Web Sites

  • Introduction

  • Problem Definition

  • Our Model

  • Inference Method

  • Experimental Results

  • Conclusions

@ SIGIR 2008 - Singapore


Conclusions 1
Conclusions (1) Product Attributes from Multiple Web Sites

  • We aim at simultaneously extracting and normalizing product attributes from Web pages collected from different sites.

  • Our method considers page-independent content information and the page-dependent layout information.

  • We have developed a graphical model, which employs Dirichlet process prior, to model the generation of text fragments in Web pages.

@ SIGIR 2008 - Singapore


Conclusions 2
Conclusions (2) Product Attributes from Multiple Web Sites

  • An unsupervised inference algorithm based on variational method is designed.

  • We formally show that content and layout information can collaborate and improve both extraction and normalization performance under our model.

  • Experiments on four different domains have been conducted to show the robustness and effectiveness of our approach.

@ SIGIR 2008 - Singapore


Questions and answers

Questions and Answers Product Attributes from Multiple Web Sites


ad