200 likes | 317 Views
Organizing Structured Web Sources by Query Schemas: A Clustering Approach. Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign. Background: MetaQuerier – Large-Scale Integration of the deep Web. Query. Result. MetaQuerier. The Deep Web. The Deep Web.
E N D
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign
Background: MetaQuerier – Large-Scale Integration of the deep Web Query Result MetaQuerier The Deep Web
The Deep Web MetaQuerier: System architecture MetaQuerier Front-end: Query Execution Result Compilation Query Translation Source Selection Query Web databases Find Web databases Deep Web Repository Query Interfaces Query Capabilities Subject Domains Unified Interfaces Back-end: Semantics Discovery Database Crawler Interface Extraction Source Organization Schema Matching
In MetaQuerier, source organization is to cluster query interfaces into implicit domains Airfares Automobiles Books
[Author; {contain}; text] [Title; {contain}; text] … … [Format; {=}; {hardcopy, paperback, …}] … … Interface Extraction [ SIGMOD 2004 ] Query Interface Query Schema What are the representative feature of query interfaces? Is query schema the feature we are looking for?
Query schemas are appropriate representatives of Web databases: distinctive property Airfares Movies Hotels Number of observations Attributes Index Attributes Index Attributes Index • Each domain contains a dominant range of attributes, distinctive from other domains • Some attributes are only observed in one domain (anchor attributes): For example: ISBN for Books, MPAA Rating for Movies, • Source organization becomes the clustering of query schemas
Query schemas can be viewed as categorical data • Query schemas as transactions: S1: {author, title, subject, ISBN} S2: {author, title, category, publisher} S3: {make, model, price, zip code} S4: {manufacturer, model, price} S5: {from, to, departure date, return date, number of passengers} S6: {departure city, arrival city, number of adults, number of children} …… • Thus, we can apply algorithms for clustering categorical data
Clustering categorical data: Objective function • Clustering needs to have an objective function to evaluate the quality of clusters • Existing objective functions • Likelihood [1998] (Model-based clustering) • Context Linkage [ROCK 2000] • Entropy [COOLCAT 2002] • In this paper, we propose a new objective function • Model-Differentiation
Model-Differentiation: A new objective function for model-based clustering • Assumption of model-base clustering: Each cluster Ci has a generative model Mi to generate its data with probabilistic behavior • What is a good clustering result? (our observation) data in different clusters are very dissimilar • models of different clusters are very dissimilar • a new objective function: maximize the dissimilarity of models • To realize, we need to answer three questions: • How to model the data? • How to estimate the model, given data? • How to measure the dissimilarity of models?
Modeling: Multinomial distribution • Each attribute is an independent event • A schema is generated by a series of sampling from M Model M A schema: {title, author, ISBN} Vocabulary: author (P1) publisher (P2) title (P3) ISBN (P4) city (P5) price (P6) model (P7) … P1 ISBN author title P3 P4 Probability: P1*P3*P4
Model estimation: Given a set of data, how to estimate its model? • Maximum likelihood estimation S1 = {title, author, ISBN}, S2 = {author, ISBN, publisher} S3 = {author, title, price}, S4 = {author, title, price} Vocabulary: author, title, ISBN, price, publisher
Measuring the dissimilarity of models: Statistical hypothesis testing • Multinomial distribution can be directly tested by χ2 testing S1 = {title, author, ISBN}, S2 = {author, ISBN, price}, S3 = {make, model, price} Pro Pro M<1,2> M3 1. Combining S1 and S2: Attrs Attrs Pro Pro M<1,3> M2 2. Combining S1 and S3: Attrs Attrs Pro Pro M<2,3> M1 3. Combining S2 and S3: Attrs Attrs Inspire a hierarchical agglomerative clustering (HAC) algorithm
Hypothesis testing needs sufficient observations: Pre-clustering to form small clusters Distinguishable S2 S1: with anchor attributes S1 and S2 should be in the same domain and thus pre-clustered How to decide whether an S is “distinguishable” ? Sup(S1) Any Si, Sj in Sup(S1) S1
Post-classification: Handling “loners” Separate Pre-clustering Model clustering Loners: too small for X2 test after pre-clustering Naïve Bayesian
Experiments • Data • Questions to answer: • Can schema clustering effectively organize Web databases? • Can it build a domain hierarchy correctly?
We also try existing objective functions • Three existing objective functions • Likelihood: maximize likelihood • Entropy: maximize entropy • Context Linkage: minimize cross links • To be fair, keep pre-clustering and post classification, and only change the clustering step by different measures
Effectiveness of Clustering • 8 domains, 8 clusters • Most Web databases are clustered correctly • Quantitatively analysis: Conditional Entropy (the smaller, the better) Model-Differentiation: 0.32; Likelihood: 0.42; Entropy: 0.38; Context Linkage: 0.61
To build a domain hierarchy • After 8 clusters, continue to run the HAC algorithm to merge them together • It is consistent with common-sense: close concepts are merged first
Conclusions • Cluster Web databases using their query schemas • First work on clustering Web databases, not pages • Query schemas are good representatives • Essentially a problem of clustering categorical data • A new objective function: Model-Differentiation • Realized by statistical hypothesis testing • Derive different similarity measure for HAC