1 / 18

COMP 4332 Tutorial 4 March 4 TAN, Ben btan @cset.hk

Classification tools. COMP 4332 Tutorial 4 March 4 TAN, Ben btan @cse.ust.hk. Fact Sheets: Classifier. KDDCUP 2009. CLASSIFIER (overall usage=93%). Decision tree. Linear classifier. Non-linear kernel. About 30% logistic loss , >15% exp loss, >15% sq loss, ~10% hinge loss.

sheila
Download Presentation

COMP 4332 Tutorial 4 March 4 TAN, Ben btan @cset.hk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification tools COMP 4332 Tutorial 4 March 4 TAN, Ben btan@cse.ust.hk

  2. Fact Sheets:Classifier KDDCUP 2009 CLASSIFIER (overall usage=93%) Decision tree... Linear classifier Non-linear kernel • About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss. • Less than 50% regularization (20% 2-norm, 10% 1-norm). • Only 13% unlabeled data. Other Classif Neural Network Naïve Bayes Nearest neighbors Bayesian Network Bayesian Neural Network 0 10 20 30 40 50 60 Percent of participants

  3. Top 10 data mining algorithms • http://www.cs.uvm.edu/~icdm/algorithms/index.shtml • C4.5 • K-means • SVM • Apriori • EM • PageRank • AdaBoost • kNN • Naïve Bayes • CART (Classification and Regression Trees) Six are classification!!!

  4. My view of classification • 1. Tree-based models • Trees and tree ensemble • 2. Linear family and its kernel extension • Least Squares Regression • Logistic Regression • SVM • 3. Others: Naïve Bayes, KNN

  5. Tree-based models • Two most famous trees: C4.5 and CART • C4.5 has two solid implementations: • Ross Quinlan’s original C++ program. And later C 5.0. Source code available. • Weka’s J48 classifier. • CART: rpart package in R. • Tree ensemble: • AdaBoost • Bagging • RandomForest

  6. Recommended tools for trees • For dense datasets, e.g. # of attributes < 1000, use Weka. • For sparse datasets, • Try FEST.

  7. Linear family and its kernel extension • Learning samples: , • Minimize classification loss: • : map to its kernel space, for linear classification, • : loss function • Classification label:

  8. Loss functions • Square loss: • Logistic loss: • Hinge loss: • More: http://ttic.uchicago.edu/~dmcallester/ttic101-06/lectures/genreg/genreg.pdf • First three lecture notes of Stanford-CS 229, http://cs229.stanford.edu/materials.html

  9. Recommended tools for linear family • LibSVM and SVM Light for Kernel SVM • LibLinear for linear classifiers with different loss functions. • Online linear classification (very fast!): Stochastic Gradient Descendent (Léon Bottou’ssgdand John Langford’s VowpalWabbit)

  10. Naïve Bayes and KNN • For dense dataset, use Weka. • For sparse dataset, implement them yourself. Very easy to implement. • Weka is able to load sparse dataset, but not sure about its speed and memory usage.

  11. Empirical comparison of classifiers • Caruanaet. al: An empirical evaluation of supervised learning in high dimensions, ICML’08.Slide • Caruanaet. al: An empirical comparison of supervised learning algorithms. ICML’06. SlideVideo

  12. Learning a classification tool • Data Format • Parameters in the tool • Have to learn the mathematics behind the classifier. At least intuitively, if not rigorously. • Read its manual and sometimes source code!

  13. DEMO: LibLinear • Step 1: Download the source code and binaries. Compile the source code if necessary. • Step 2: Read its README/tutorial, and usually the tutorial has a running example. Follow the example. • Step 3: Study its documentation and know about its data format and its parameters. • Step 4: Use it on your data set.

  14. Dataset format • ARFF: Weka • Libsvm sparse: SvmLight, LibSvm, LibLinear, Sgd, etc. • Dense vs Sparse

  15. ARFF: Attribute-Relation File Format • Documentation: • http://www.cs.waikato.ac.nz/ml/weka/arff.html • ARFF also supports sparse format

  16. LibSvm Sparse Format Line Format: Label Index:Value pairs Label: +1/-1 for binary classification, 1/2/3/4/etc for multi-class.

  17. Predicting a score (Not a label) • Many classifiers support probability output: • Nearly all classifiers in Weka support probability output. • LibSVM/LibLinear supports probability output. • SvmLight outputs a real value from –Inf to +Inf. • More details on: • Caruanaet. al: An empirical comparison of supervised learning algorithms. ICML’06.

  18. DEMO: experiment.pyFrom raw data to a successful submission • Read the raw data and do preprocessing • Transform the data to the input format of a classification tool (liblinear in our example). • Perform training and testing using the tool. • Wrap up the results and submit online.

More Related