220 likes | 399 Views
Catalog Integration. R. Agrawal, R. Srikant: WWW-10. Catalog Integration Problem. Integrate products from new catalog into master catalog. The Problem (cont.). After integration:. Desired Solution. Automatically integrate products: little or no effort on part of user. domain independent.
E N D
Catalog Integration R. Agrawal, R. Srikant: WWW-10
Catalog Integration Problem • Integrate products from new catalog into master catalog.
The Problem (cont.) • After integration:
Desired Solution • Automatically integrate products: • little or no effort on part of user. • domain independent. • Problem size: • Million products • Thousands of categories
Model • Product descriptions consist of words • Products live in the leaf-level categories
Basic Algorithm • Build classification model using product descriptions in master catalog. • Use classification model to predict categories for products in the new catalog.
Accuracy on Pangea Data • B2B Portal for electronic components: • 1200 categories, 40K training documents. • 500 categories with < 5 documents. • Accuracy: • 72% for top choice. • 99.7% for top 5 choices.
Enhanced Algorithm: Intuition • Use affinity information in the catalog to be integrated (new catalog): • Products in same category are similar. • Bias the classifier to incorporate this information. • Accuracy boost depends on quality of new catalog: • Use tuning set to determine amount of bias.
Algorithm • Extension of the Naive-Bayes classification to incorporate affinity information
Naive Bayes Classifier • Pr(Ci|d) = Pr(Ci)Pr(d|Ci)/Pr(d) //Baye’s Rule • Pr(d): same for all categories (ignore) • Pr(Ci) = #docs Ci / #total docs • Pr(d|Ci) = Pwd Pr(w|Ci) • Words occur independently (unigram model) • Pr(w|Ci) = (n(Ci ,w)+) / (n(Ci)+ |V|) • Maximum likelihood estimate smoothed with the Lidstone’s law of succession
Enhanced Algorithm • Pr(Ci|d,S) //d existed in category S • Pr(Ci,d,S) / Pr(d,S) • Pr(Ci,d,S) = Pr(d,S) Pr(Ci|d,S) • Pr(Ci)Pr(S,d|Ci) / Pr(d,S) • Pr(Ci)Pr(S|Ci)Pr(d| Ci) / Pr(S,d) • Assuming d, S independent given Ci • Pr(S)Pr(Ci|S)Pr(d| Ci) / Pr(S,d) • Pr(S|Ci) Pr(Ci) = Pr(Ci|S) Pr(S) • Pr(Ci|S)Pr(d|Ci) / Pr(d|S) • Pr(S,d) = Pr(S)Pr(d|S) • Same as NB except Pr(Ci|S) instead of Pr(Ci) • Ignore Pr(d|S) as it is same for all classes
Computing Pr(Ci|S) • Pr(Ci|S) = |Ci|(#docs in S predicted to be in Ci)w / j[1,n] |Cj|(#docs in S predicted to be in Cj)w • |Ci| = #docs in Ci in the master catalog • w determines weight of the new catalog • Use a tune set of documents in the new catalog for which the correct categorization in the master catalog is known • Choose one weight for the entire new catalog or different weights for different sections
Superiority of the Enhanced Algorithm • Theorem: The highest possible accuracy achievable with the enhanced algorithm is no worse than what can be achieved with the basic algorithm. • Catch: The optimum value of the weight for which enhanced achieves highest accuracy is data dependent. • The tune set method attempts to select a good value for weight, but there is no guarantee of success.
Empirical Evaluation • Start with a real catalog M • Remove n products from M to form the new catalog N • In the new catalog N • Assign f*n products to the same category as M • Assign the rest to other categories as per some distribution (but remember their true category) • Accuracy: Fraction of products in N assigned to their true categories
Summary • Classification accuracy can be improved by factoring in the affinity information implicit in the data to be categorized. • How to apply these ideas to other types of classifiers?