Classification I

Classification I Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Overview • K-Nearest Neighbor Algorithm • Naïve Bayes Classifier Thomas Bayes

Classification

Definition • Classification is one of the fundamental skills for survival. • Food vs. Predator • A kind of supervised learning • Techniques for deducing a function from data • <Input, Output> • Input: a vector of features • Output: a Boolean value (binary classification) or integer (multiclass) • “Supervised” means: • A teacher or oracle is needed to label each data sample. • We will talk about unsupervised learning later.

Classifiers {boy, girl} Peter Weight Sam Jack Z=f(x,y) Jane Tom Lisa Helen Mary Height Weight Height

Training a Classifier Learning

Lazy Learners Truck Car

Neighborhood

K-Nearest Neighbor Algorithm • The algorithm procedure: • Given a set of n training data in the form of <x, y>. • Given an unknown sample x′. • Calculate the distance d(x′, xi) for i=1 … n. • Select the K samples with the shortest distances. • Assign x′ the label that dominates the K samples. • It is the simplest classifier you will ever meet (I mean it!). • No Training (literally) • A memory of the training data is maintained. • All computation is deferred until classification. • Produces satisfactory results in many cases. • Should give it a go whenever possible.

Properties of KNN Instance-Based Learning No explicit description of the target function Can handle complicated situations.

Properties of KNN K=7 Neighborhood ? K=1 Neighborhood Dependent of the data distributions. Can make mistakes at boundaries.

Challenges of KNN • The Value of K • Non-monotonous impact on accuracy • Too Big vs. Too Small • Rule of thumbs • Weights • Different features may have different impact … • Distance • There are many different ways to measure the distance. • Euclidean, Manhattan … • Complexity • Need to calculate the distance between x′ and all training data. • In proportion to the size of the training data. Accuracy K

Distance Metrics

Distance Metrics The shortest path between two points …

Mahalanobis Distance Distance from a point to a point set

Mahalanobis Distance For identity matrix S: For diagonal matrix S:

Voronoi Diagram

Structured Data 1 ? 0.5 0 0.5 1

KD-Tree Point Set: {(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)}

KD-Tree • functionkdtree (list of pointspointList, intdepth) • { • ifpointListis empty • returnnil; • else • { • // Select axis based on depth so that axis cycles through all valid values • varintaxis := depth modk; • // Sort point list and choose median as pivot element • selectmedian byaxis frompointList; • // Create node and construct subtrees • vartree_nodenode; • node.location := median; • node.leftChild := kdtree(points inpointListbeforemedian, depth+1); • node.rightChild := kdtree(points inpointListaftermedian, depth+1); • returnnode; • } • }

KD-Tree

Evaluation • Accuracy • Recall what we have learned in the first lecture … • Confusion Matrix • ROC Curve • Training Set vs. Test Set • N-fold Cross Validation

LOOCV • Leave One Out Cross Validation • An extreme case of N-fold cross validation • N=number of available samples • Usually very time consuming but okay for KNN • Now, let’s try KNN+LOOCV … • All students in this class are given one of two labels. • Gender: Male vs. Female • Major: CS vs. EE vs. Automation

10 Minutes …

Bayes Theorem A B Bayes Theorem

Fish Example • Salmon vs. Tuna • P(ω1)=P(ω2) • P(ω1)>P(ω2) • Additional information

Shooting Example • Probability of Kill • P(A): 0.6 • P(B): 0.5 • The target is killed with: • One shoot from A • One shoot from B • What is the probability that it is shot down by A? • C: The target is killed.

Cancel Example • ω1: Cancer; ω2: Normal • P(ω1)=0.008; P(ω2)=0.992 • Lab Test Outcomes: + vs. – • P(+|ω1)=0.98; P(-|ω1)=0.02 • P(+|ω2)=0.03; P(-|ω2)=0.97 • Now someone has a positive test result… • Is he/she doomed?

Cancel Example

Headache & Flu Example • H=“Having a headache” • F=“Coming down with flu” • P(H)=1/10; P(F)=1/40; P(H|F)=1/2 • What does this mean? • One day you wake up with a headache … • Since 50% of flus are associated with headaches … • I must have a 50-50 chance of coming down with flu!

Headache and Flu Example The truth is … Flu Headache

Naïve Bayes Classifier Conditionally Independent MAP: Maximum APosterior

Independence Conditionally Independent

Conditional Independence

Independent ≠ Uncorrelated Cov (X,Y)=0  X and Y are uncorrelated However, Y is completely determined by X.

Estimating P(αj|ωi) Laplace Smoothing How about continuous variables?

Tennis Example

Text Classification Example Interesting? Boring? Politics? Entertainment? Sports?

Text Representation We need to estimate probabilities such as . However, there are 2×n×|Vocabulary| terms in total. For n=100 and a vocabulary of 50,000 distinct words, it adds up to 10 million terms!

Text Representation • By only considering the probability of encountering a specific word instead of the specific word position, we can reduce the number of probabilities to be estimated. • We only count the frequency of each word. • Now, 2×50,000=100,000 terms need to be estimated. • n: the total number of word positions in all training samples whose target value is ωi. • nk: the number of times word Vk is found among these n positions.

Case Study: Newsgroups • Classification • Joachims, 1996 • 20 newsgroups • 20,000 documents • Random Guess: 5% • NB: 89% • Recommendation • Lang, 1995 • NewsWeeder • User rated articles • Interesting vs. Uninteresting • Top 10% selected articles • 16% vs. 59%

Reading Materials • C. C. Aggarwal, A. Hinneburg and D. A. Keim, “On the Surprising Behavior of Distance Metrics in High Dimensional Space,” Proc. the 8th International Conference on Database Theory, LNCS 1973, pp. 420-434, London, UK, 2001. • J. H. Friedman, J. L. Bentley, and R. A. Finkel, “An Algorithm for Finding Best Matches in Logarithmic Expected Time,” ACM Transactions on Mathematical Software, 3(3):209–226, 1977. • S. M. Omohundro, “Bumptrees for Efficient Function, Constraint, and Classification Learning,” Advances in Neural Information Processing Systems 3, pp. 693-699, Morgan Kaufmann, 1991. • Tom Mitchell, Machine Learning (Chapter 6), McGraw-Hill. • Additional reading about Naïve Bayes Classifier • http://www-2.cs.cmu.edu/~tom/NewChapters.html • Software for text classification using Naïve Bayes Classifier • http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html

Review • What is classification? • What is supervised learning? • What does KNN stand for? • What are the major challenges of KNN? • How to accelerate KNN? • What is N-fold cross validation? • What does LOOCV stand for? • What is Bayes Theorem? • What is the key assumption in Naïve Bayes Classifiers?

Next Week’s Class Talk • Volunteers are required for next week’s class talk. • Topic 1: Efficient KNN Implementations • Hints: • Ball Trees • Metric Trees • R Trees • Topic 2: Bayesian Belief Networks • Length: 20 minutes plus question time

Classification I