Classification in Data Mining: Techniques and Applications

CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification

What is Classification? • Problem: assign items to pre-defined classes • Sample Y = Y1 … Yn • Set of classes X • Given Y, choose C that contains Y • How do we know how to do this? • Training data: set of items for which proper Xi is known.

Issues • Classification accuracy • False positives, False negatives • No clear “best” metric • Computation cost • Training • Classification

Approaches: • Naïve Bayes • K-Nearest Neighbor • Decision rules/Decision trees • Neural Networks

Naïve Bayes: History Bayes classifier: From Probability Theory • Idea: A-posteriori probability of class given all inputs is best possible classifier • Problem: doesn’t generalize. • Solution: Bayesian Belief network Y2 Y1 Y4 Y3 P(Xi|Y) = P(Y4|Y2,Y3)P(Y2|Y1)P(Y3|Y1)P(Y1)

Problems with Bayesian Belief Network • What should the network structure be? • Some work in how to learn the structure • Getting it wrong results in over-specificity • What are the probabilities? • Learning techniques exist here • Computational cost to learn network

Naïve Bayes • Two-layer Bayes network • No need to learn structure • Assumes inputs independent • Learn the probabilities that work best on training data Y1 Y2 Y3 P(X|Y1...Yn) = P(X)*Πi P(Yi|X) X

K-Nearest Neighbor • Idea: Choose “closest” training item • Class of test is same as class of closest training item • Need to define distance • What if this is a bad match? • Find K closest items • Use most common class in those K

KNN: Advantages • As training set → ∞, K → ∞, result approaches optimal • View as “best probability over all samples”: this is Bayes theorem • Training simple • Just put training set into a data structure

KNN: Problems • With small K, only captures convex classes • High dimensionality: may be “nearest” in irrelevant attributes • Query time: Search all training data • Algorithms to make this faster • But good enough to be “standard” for comparison

Classification and Security • Ideas on how to use classifiers to improve security • Intrusion Detection • ? • Potential risks • Identifying private information based on similarity with training data

Classification in Data Mining: Techniques and Applications

Classification in Data Mining: Techniques and Applications

Presentation Transcript

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

Data Mining: Preprocessing Techniques

Chapter 3: Data Mining and Data Visualization

Mining data with PolyAnalyst

Data Mining on Streams

DATA MINING LECTURE 4

Web Mining

CSE 538 Web Search and Mining Web Crawling

CS490D: Introduction to Data Mining Prof. Walid Aref

What we have covered?

MMDSS 2007 Data stream management and mining

Networks and Security

CSE 331: Introduction to Networks and Security

15-826: Multimedia Databases and Data Mining

Data Mining with Big Data

Data Mining: Classification and Prediction

Spatial Data Mining

Data Mining: Concepts and Techniques