1 / 11

CS 590M Fall 2001: Security Issues in Data Mining

CS 590M Fall 2001: Security Issues in Data Mining. Lecture 3: Classification. What is Classification?. Problem: assign items to pre-defined classes Sample Y = Y 1 … Y n Set of classes X Given Y, choose C that contains Y How do we know how to do this?

Download Presentation

CS 590M Fall 2001: Security Issues in Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification

  2. What is Classification? • Problem: assign items to pre-defined classes • Sample Y = Y1 … Yn • Set of classes X • Given Y, choose C that contains Y • How do we know how to do this? • Training data: set of items for which proper Xi is known.

  3. Issues • Classification accuracy • False positives, False negatives • No clear “best” metric • Computation cost • Training • Classification

  4. Approaches: • Naïve Bayes • K-Nearest Neighbor • Decision rules/Decision trees • Neural Networks

  5. Naïve Bayes: History Bayes classifier: From Probability Theory • Idea: A-posteriori probability of class given all inputs is best possible classifier • Problem: doesn’t generalize. • Solution: Bayesian Belief network Y2 Y1 Y4 Y3 P(Xi|Y) = P(Y4|Y2,Y3)P(Y2|Y1)P(Y3|Y1)P(Y1)

  6. Problems with Bayesian Belief Network • What should the network structure be? • Some work in how to learn the structure • Getting it wrong results in over-specificity • What are the probabilities? • Learning techniques exist here • Computational cost to learn network

  7. Naïve Bayes • Two-layer Bayes network • No need to learn structure • Assumes inputs independent • Learn the probabilities that work best on training data Y1 Y2 Y3 P(X|Y1...Yn) = P(X)*Πi P(Yi|X) X

  8. K-Nearest Neighbor • Idea: Choose “closest” training item • Class of test is same as class of closest training item • Need to define distance • What if this is a bad match? • Find K closest items • Use most common class in those K

  9. KNN: Advantages • As training set → ∞, K → ∞, result approaches optimal • View as “best probability over all samples”: this is Bayes theorem • Training simple • Just put training set into a data structure

  10. KNN: Problems • With small K, only captures convex classes • High dimensionality: may be “nearest” in irrelevant attributes • Query time: Search all training data • Algorithms to make this faster • But good enough to be “standard” for comparison

  11. Classification and Security • Ideas on how to use classifiers to improve security • Intrusion Detection • ? • Potential risks • Identifying private information based on similarity with training data

More Related