300 likes | 444 Views
Mapping Between Taxonomies. Elena Eneva 11 Dec 2001 Advanced IR Seminar. Mapping Between Taxonomies. Formal systems of orderly classification of knowledge, which are designed for a specific purpose
E N D
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar
Mapping Between Taxonomies • Formal systems of orderly classification of knowledge, which are designed for a specific purpose • Companies, organizing information in various ways (eg. one for marketing, another for product development)
German Textile Approach French Automobile By country By industry
German Textile Approach French Automobile By country By industry
German Textile Approach French Automobile By country By industry
German Textile Approach French Automobile By country By industry
Textile Approach Automobile By industry
abc abc abc abc abc abc Textile Approach Automobile abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc By industry
Textile Approach Automobile abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc By industry
German Textile Approach French Automobile By country abc abc abc abc By industry
German Textile Approach French Automobile By country abc abc abc abc By industry
German Textile Approach French Automobile By country abc abc abc abc By industry abc abc abc abc
Datasets Two classification schemes: • Reuter 2001 (807900 docs) • Topics (127) • Industry categories (871) • Regions (376) • Hoovers-255 and Hoovers-28 (4286 docs) • industry categories (28) • industry categories (255)
Learning • 2 separate methods of learning for the documents: • Old doc category -> new doc category • Doc contents -> new category • Combined method: • Weighted average based on confidence • Final result determined by a decision tree • One combined learner – used both old category and contents as features
Simple Learners • Simple Decision Tree (C4.5) – learns probabilities of new categories based on 1 kind of feature: • Old categories (doesn’t know about documents/words) • Word-based classification (doesn’t know about old categories) • Naïve Bayes (rainbow) • Old categories (doesn’t know about documents/words) • Word-based classification (doesn’t know about old categories) • Support Vector Machine (SVM-Light) • word-based classification (doesn’t know about old categories), linear kernel [results will be reported in the final paper]
Learning DT, NB, SVM abc • Using the document content abc abc abc abc abc DT, NB, SVM • Using the document labels
Combined Learners • Weighted Average • Voting scheme • Combination Decision Tree • takes the outputs and confidences of two of the simple learners, predicts new category
abc abc abc abc abc abc abc abc abc abc DT abc abc voting DT, NB, SVM DT, NB, SVM 3rd classifier Learning • Using both the content and the label • Combining the two outputs
Results Words Only • 5-fold cross validation
Results Categories Only • 5-fold cross validation
Results Combination • 5-fold cross validation
Remarks • Hierarchy (old classes) usually ignored • Shown that helps • Learners are not the issue • Better way of understanding • Old label (or hierarchy path) is meta data
Remaining Work • SVM results (running even as we speak) • Repeat experiments on Reuters-2001 • Internal hierarchies • Missing labels • Less correlated types of classes • Results in standard evaluation format
Future Work • Try with a web dataset (Google and Yahoo! Hierarchies) • Hierarchies of more levels • Meta data (for non-text sources)
Related Literature • A study of Approaches to Hypertext, Y. Yang, S. Slattery, R. Ghani, Journal of Intelligent Information Systems, Volume 18, Number 2, March 2002 (to appear). • Learning Mappings between Data Schemas , A. Doan, P. Domingos, and A. Levy. Proceedings of the AAAI-2000 Workshop on Learning Statistical Models from Relational Data, 2000, Austin, TX.
Questions and Suggestions The end.
Taxonomies • Formal systems of orderly classification of knowledge, which are designed for a specific purpose • Change of purpose, change of taxonomies • Businesses often need and keep the information in several structures • Important to be able to automatically map between taxonomies
Useful Mappings • Companies, organizing information in various ways (eg. one for marketing, another for product development) • Personal online bookmark classification • Search engines (eg. Google <-> Yahoo) • EU Committee for Standardization “detailed overview of the existing taxonomies officially used in the EU, in order to derive general concepts such as: information organisation, properties, multilinguality, keywords, etc. and, last but not least, the mapping between.”