110 likes | 179 Views
CS 590M Fall 2001: Security Issues in Data Mining. Lecture 3: Classification. What is Classification?. Problem: assign items to pre-defined classes Sample Y = Y 1 … Y n Set of classes X Given Y, choose C that contains Y How do we know how to do this?
E N D
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification
What is Classification? • Problem: assign items to pre-defined classes • Sample Y = Y1 … Yn • Set of classes X • Given Y, choose C that contains Y • How do we know how to do this? • Training data: set of items for which proper Xi is known.
Issues • Classification accuracy • False positives, False negatives • No clear “best” metric • Computation cost • Training • Classification
Approaches: • Naïve Bayes • K-Nearest Neighbor • Decision rules/Decision trees • Neural Networks
Naïve Bayes: History Bayes classifier: From Probability Theory • Idea: A-posteriori probability of class given all inputs is best possible classifier • Problem: doesn’t generalize. • Solution: Bayesian Belief network Y2 Y1 Y4 Y3 P(Xi|Y) = P(Y4|Y2,Y3)P(Y2|Y1)P(Y3|Y1)P(Y1)
Problems with Bayesian Belief Network • What should the network structure be? • Some work in how to learn the structure • Getting it wrong results in over-specificity • What are the probabilities? • Learning techniques exist here • Computational cost to learn network
Naïve Bayes • Two-layer Bayes network • No need to learn structure • Assumes inputs independent • Learn the probabilities that work best on training data Y1 Y2 Y3 P(X|Y1...Yn) = P(X)*Πi P(Yi|X) X
K-Nearest Neighbor • Idea: Choose “closest” training item • Class of test is same as class of closest training item • Need to define distance • What if this is a bad match? • Find K closest items • Use most common class in those K
KNN: Advantages • As training set → ∞, K → ∞, result approaches optimal • View as “best probability over all samples”: this is Bayes theorem • Training simple • Just put training set into a data structure
KNN: Problems • With small K, only captures convex classes • High dimensionality: may be “nearest” in irrelevant attributes • Query time: Search all training data • Algorithms to make this faster • But good enough to be “standard” for comparison
Classification and Security • Ideas on how to use classifiers to improve security • Intrusion Detection • ? • Potential risks • Identifying private information based on similarity with training data