180 likes | 322 Views
Classification tools. COMP 4332 Tutorial 4 March 4 TAN, Ben btan @cse.ust.hk. Fact Sheets: Classifier. KDDCUP 2009. CLASSIFIER (overall usage=93%). Decision tree. Linear classifier. Non-linear kernel. About 30% logistic loss , >15% exp loss, >15% sq loss, ~10% hinge loss.
E N D
Classification tools COMP 4332 Tutorial 4 March 4 TAN, Ben btan@cse.ust.hk
Fact Sheets:Classifier KDDCUP 2009 CLASSIFIER (overall usage=93%) Decision tree... Linear classifier Non-linear kernel • About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss. • Less than 50% regularization (20% 2-norm, 10% 1-norm). • Only 13% unlabeled data. Other Classif Neural Network Naïve Bayes Nearest neighbors Bayesian Network Bayesian Neural Network 0 10 20 30 40 50 60 Percent of participants
Top 10 data mining algorithms • http://www.cs.uvm.edu/~icdm/algorithms/index.shtml • C4.5 • K-means • SVM • Apriori • EM • PageRank • AdaBoost • kNN • Naïve Bayes • CART (Classification and Regression Trees) Six are classification!!!
My view of classification • 1. Tree-based models • Trees and tree ensemble • 2. Linear family and its kernel extension • Least Squares Regression • Logistic Regression • SVM • 3. Others: Naïve Bayes, KNN
Tree-based models • Two most famous trees: C4.5 and CART • C4.5 has two solid implementations: • Ross Quinlan’s original C++ program. And later C 5.0. Source code available. • Weka’s J48 classifier. • CART: rpart package in R. • Tree ensemble: • AdaBoost • Bagging • RandomForest
Recommended tools for trees • For dense datasets, e.g. # of attributes < 1000, use Weka. • For sparse datasets, • Try FEST.
Linear family and its kernel extension • Learning samples: , • Minimize classification loss: • : map to its kernel space, for linear classification, • : loss function • Classification label:
Loss functions • Square loss: • Logistic loss: • Hinge loss: • More: http://ttic.uchicago.edu/~dmcallester/ttic101-06/lectures/genreg/genreg.pdf • First three lecture notes of Stanford-CS 229, http://cs229.stanford.edu/materials.html
Recommended tools for linear family • LibSVM and SVM Light for Kernel SVM • LibLinear for linear classifiers with different loss functions. • Online linear classification (very fast!): Stochastic Gradient Descendent (Léon Bottou’ssgdand John Langford’s VowpalWabbit)
Naïve Bayes and KNN • For dense dataset, use Weka. • For sparse dataset, implement them yourself. Very easy to implement. • Weka is able to load sparse dataset, but not sure about its speed and memory usage.
Empirical comparison of classifiers • Caruanaet. al: An empirical evaluation of supervised learning in high dimensions, ICML’08.Slide • Caruanaet. al: An empirical comparison of supervised learning algorithms. ICML’06. SlideVideo
Learning a classification tool • Data Format • Parameters in the tool • Have to learn the mathematics behind the classifier. At least intuitively, if not rigorously. • Read its manual and sometimes source code!
DEMO: LibLinear • Step 1: Download the source code and binaries. Compile the source code if necessary. • Step 2: Read its README/tutorial, and usually the tutorial has a running example. Follow the example. • Step 3: Study its documentation and know about its data format and its parameters. • Step 4: Use it on your data set.
Dataset format • ARFF: Weka • Libsvm sparse: SvmLight, LibSvm, LibLinear, Sgd, etc. • Dense vs Sparse
ARFF: Attribute-Relation File Format • Documentation: • http://www.cs.waikato.ac.nz/ml/weka/arff.html • ARFF also supports sparse format
LibSvm Sparse Format Line Format: Label Index:Value pairs Label: +1/-1 for binary classification, 1/2/3/4/etc for multi-class.
Predicting a score (Not a label) • Many classifiers support probability output: • Nearly all classifiers in Weka support probability output. • LibSVM/LibLinear supports probability output. • SvmLight outputs a real value from –Inf to +Inf. • More details on: • Caruanaet. al: An empirical comparison of supervised learning algorithms. ICML’06.
DEMO: experiment.pyFrom raw data to a successful submission • Read the raw data and do preprocessing • Transform the data to the input format of a classification tool (liblinear in our example). • Perform training and testing using the tool. • Wrap up the results and submit online.