COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

Classification tools COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cse.ust.hk

Fact Sheets:Classifier KDDCUP 2009 CLASSIFIER (overall usage=93%) Decision tree... Linear classifier Non-linear kernel • About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss. • Less than 50% regularization (20% 2-norm, 10% 1-norm). • Only 13% unlabeled data. Other Classif Neural Network Naïve Bayes Nearest neighbors Bayesian Network Bayesian Neural Network 0 10 20 30 40 50 60 Percent of participants

Top 10 data mining algorithms • http://www.cs.uvm.edu/~icdm/algorithms/index.shtml • C4.5 • K-means • SVM • Apriori • EM • PageRank • AdaBoost • kNN • Naïve Bayes • CART (Classification and Regression Trees) Six are classification!!!

My view of classification • 1. Tree-based models • Trees and tree ensemble • 2. Linear family and its kernel extension • Least Squares Regression • Logistic Regression • SVM • 3. Others: Naïve Bayes, KNN

Tree-based models • Two most famous trees: C4.5 and CART • C4.5 has two solid implementations: • Ross Quinlan’s original C++ program. And later C 5.0. Source code available. • Weka’s J48 classifier. • CART: rpart package in R. • Tree ensemble: • AdaBoost • Bagging • RandomForest

Recommended tools for trees • For dense datasets, e.g. # of attributes < 1000, use Weka. • For sparse datasets, • Try FEST.

Linear family and its kernel extension • Learning samples: , • Minimize classification loss: • : map to its kernel space, for linear classification, • : loss function • Classification label:

Loss functions • Square loss: • Logistic loss: • Hinge loss: • More: http://ttic.uchicago.edu/~dmcallester/ttic101-06/lectures/genreg/genreg.pdf • First three lecture notes of Stanford-CS 229, http://cs229.stanford.edu/materials.html

Recommended tools for linear family • LibSVM and SVM Light for Kernel SVM • LibLinear for linear classifiers with different loss functions. • Online linear classification (very fast!): Stochastic Gradient Descendent (Léon Bottou’ssgdand John Langford’s VowpalWabbit)

Naïve Bayes and KNN • For dense dataset, use Weka. • For sparse dataset, implement them yourself. Very easy to implement. • Weka is able to load sparse dataset, but not sure about its speed and memory usage.

Empirical comparison of classifiers • Caruanaet. al: An empirical evaluation of supervised learning in high dimensions, ICML’08.Slide • Caruanaet. al: An empirical comparison of supervised learning algorithms. ICML’06. SlideVideo

Learning a classification tool • Data Format • Parameters in the tool • Have to learn the mathematics behind the classifier. At least intuitively, if not rigorously. • Read its manual and sometimes source code!

DEMO: LibLinear • Step 1: Download the source code and binaries. Compile the source code if necessary. • Step 2: Read its README/tutorial, and usually the tutorial has a running example. Follow the example. • Step 3: Study its documentation and know about its data format and its parameters. • Step 4: Use it on your data set.

Dataset format • ARFF: Weka • Libsvm sparse: SvmLight, LibSvm, LibLinear, Sgd, etc. • Dense vs Sparse

ARFF: Attribute-Relation File Format • Documentation: • http://www.cs.waikato.ac.nz/ml/weka/arff.html • ARFF also supports sparse format

LibSvm Sparse Format Line Format: Label Index:Value pairs Label: +1/-1 for binary classification, 1/2/3/4/etc for multi-class.

Predicting a score (Not a label) • Many classifiers support probability output: • Nearly all classifiers in Weka support probability output. • LibSVM/LibLinear supports probability output. • SvmLight outputs a real value from –Inf to +Inf. • More details on: • Caruanaet. al: An empirical comparison of supervised learning algorithms. ICML’06.

DEMO: experiment.pyFrom raw data to a successful submission • Read the raw data and do preprocessing • Transform the data to the input format of a classification tool (liblinear in our example). • Perform training and testing using the tool. • Wrap up the results and submit online.

COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

Presentation Transcript

Feb 23 rd 2017

COMP 360/560 Tutorial

COMP 4332 Tutorial 2 Feb 18 Chen Zhao

COMP 4332 Tutorial 5 March 11 CHEN Zhao zchenah@ust.hk

Feb 4

Feb.4-Feb.8.2013

COMP 4332 Tutorial 1 Feb 11 Zhao Chen zchenah@ust.hk

COMP 4332 Tutorial 4 March 4 TAN, Ben btan @cset.hk

COMP 4332 Tutorial 4 Mar 1 Yin Zhu yinz@cset.hk

COMP 4332 Tutorial 6 Mar 25 CHEN Zhao

COMP 4332 Tutorial 9 April 8 CHEN Zhao

Feb 4

COMP 4332, RMBI 4330 Advanced Data Mining (Spring 2012)

Zhu Zhu Pet 4-Pack $26

COMP 248 Tutorial 1

COMP Superscalar Tutorial

COMP 3040 Tutorial 1

COMP 360/560 Tutorial

COMP 4332 Tutorial 1 Feb 2 Yin Zhu yinz@cset.hk

COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

COMP 4332 Tutorial 10 April 12 Yin Zhu yinz@cset.hk

Comp 4