230 likes | 368 Views
Karol Furd ík, Ján Paralič, Gabriel Tutoky Karol.Furdik@intersoft.sk, {Jan.Paralic, Gabriel.Tutoky}@ tuke.sk Technical University of Košice , Slovakia. Meta-learning for automatic selection of algorithms for text classification. September 24-26, 2008 University of Zagreb, Varaždin, Croatia.
E N D
Karol Furdík, Ján Paralič, Gabriel Tutoky Karol.Furdik@intersoft.sk, {Jan.Paralic, Gabriel.Tutoky}@ tuke.sk Technical University of Košice, Slovakia Meta-learning for automatic selection of algorithmsfor text classification September 24-26, 2008 University of Zagreb, Varaždin, Croatia
Introduction • Text classification • Method for knowledge extraction from textual documents • Originally, the classification was designed as a semi-automatic procedure, where the users were responsible for selection of proper classification settings • In the most of applications (e.g. in KP-Lab project (http://www.kp-lab.org)) is requirement for fully automated text classification • Meta-Learning • Allows to automatize text classification process by automatic selection of the proper algorithms K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Text classification– two steps process Creation of the classifier Training set of documents Preprocessing of documents Learning of Classifier Classifier Usage of the classifier Document of unknown category Preprocessing of current document Classifier application Categorized document K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Meta-learning, MUDOF algorithm • MUDOF – Meta-learning Using Document Feature Characteristics • Introduced in 2002 by Wai and Kwok-Yin • Meta-learning targets: • Selection of algorithms for classifiers • Selection of algorithms is on category level (for each category is possible to select other algorithm) • Automatize and optimalize the classifiers creation process K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Meta-learning– scheme(1/4) Construction of the meta-model Training set for creation of the meta-model (TM) Values of effectiveness Testing set of documents (TE) Meta-model Usage of the meta-model Classifier K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Values of effectiveness • The A1, ... Analgorithms are “one by one” applicated on C1, ... Cmcategories from training set • The nxm binary classifiers are created • Evaluation of binary classifiers by testing data collection • Efficiency of each algorithm on each category is obtained • The most computational step in the meta-learning K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Meta-learning– scheme(2/4) Construction of the meta-model Training set for creation of the meta-model (TM) Values of effectiveness Testing set of documents (TE) Feature characteristics of particular categories Usage of the meta-model K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Feature characteristics • The categories are characterized by statistical view • Examples of characteristics: • PosTr – ratio of positive and negative instances • AvgDocLen – average document length • AvgTermVal – average term weight • AvgTopInfoGain – average info gain of best m terms • NumInfoGainThres – numbers of terms over threshold value of info gain K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Meta-learning– scheme(3/4) Construction of the meta-model Training set for creation of the meta-model (TM) Values of effectiveness Testing set of documents (TE) Feature characteristics of particular categories Meta-model Usage of the meta-model K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Meta-model • Modeling relations between feature characteristics of categories and efficiency of algorithms • Meta-model can be: • Prediction (MUDOF_R) – linear regression • Classification (MUDOF_K)– k-NN • Meta-model advantages: • “Engine” for selection of proper algorithms • Possible to use it for more than one collection of documents • In the ideal case, it is sufficient to learn a meta-model only once and then it can be used for selection of algorithms K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Meta-learning– scheme(4/4) Construction of the meta-model Training set for creation of the meta-model (TM) Values of effectiveness Testing set of documents (TE) Feature characteristics of particular categories Meta-model Usage of the meta-model Training set for creation of the classifier (TC) Feature characteristics of particular categories Meta-model Selection of algorithms for particular categories Learning of classifiers Classifier K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Data description • Reuters-21578 • 10 788 documents; 90 categories • TM (3815); TC (3961); TE (3019) • Not balanced data • 20 Newsgroups • 19 997 documents; 20 categories • TC (10 025); TE (9972) • Well balanced data K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Experiment 1 (1/3) • Testing of the meta-learning approach on single data set(Reuters text collection) • Assumes – training set is divided on: • Training set for creation of the meta-model (TM) • Training set for creation of the classifier (TC) • Target: • Increase of effectiveness of the final classifier in comparison with the base classifiers K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Experiment 1 (2/3) • Classifier effectiveness – with F1 optimized measure K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Experiment 1 (3/3) • Selection of algorithms – over AVERAGE K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Experiment 2 (1/3) • Test the usability of the meta-learning approach on two different sets of documents (Reuters & 20Newsgroups) • Assumes: • Training set of one data collection is used for creation of the meta-model • Training set of other data collection is used for creation of the classifier • Targets: • Full automatically selection of algorithms without re-learning of meta-model (meta-model learned on other data collection is used) • Better effectiveness of classifier K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Experiment 2 (2/3) • Classifier effectiveness – with F1 optimized measure K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Experiment 2 (3/3) • Selection of algorithms – over AVERAGE K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008
Conclusion • Advantages of meta-learning • Full automated text categorization – selection of algorithms is automatic • Increasing of effectiveness of the final classifier (on one data collection) • Usability of one meta-model for various data collection • Disadvantages of meta-learning • Is needed big computing and time capacity K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008