I : Introduction to Data Mining

I: Introduction to Data Mining • Short Preview • Initial Definition of Data Mining • Motivation for Data Mining • Examples of Data Mining Tasks • More detailed Survey on Data Mining • Course Information

Teaching Plan for the Next 5 Weeks • Introduction to Data Mining and Course Information • Preprocessing (Han Chapter 3) • Concept Characterization (Han Chapter 5) • Classification Techniques (multiple soursce)

Knowledge Discovery in Data [and Data Mining] (KDD) • Definition := “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad) • Frequently, the term data mining is used to refer to KDD. • Many commercial and experimental tools and tool suites are available (see http://www.kdnuggets.com/siftware.html) • Field is more dominated by industry than by research institutions Let us find something interesting!

Why Mine Data? Commercial Viewpoint • Lots of data is being collected and warehoused • Web data, e-commerce • purchases at department/grocery stores • Bank/Credit Card transactions • Computers have become cheaper and more powerful ( machine learning techniques become applicable) • Competitive Pressure is Strong • Provide better, customized services for an edge (e.g. in Customer Relationship Management)

Why Mine Data? Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) • remote sensors on a satellite • telescopes scanning the skies • microarrays generating gene expression data • scientific simulations generating terabytes of data • Traditional techniques infeasible for raw data • Data mining may help scientists • in classifying and segmenting data • in Hypothesis Formation

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications” Mining Large Data Sets - Motivation • There is often information “hidden” in the data that is not readily evident • Human analysts may take weeks to discover useful information • Much of the data is never analyzed at all The Data Gap Total new disk (TB) since 1995 Number of analysts

Data Mining Tasks • Prediction Methods • Use some variables to predict unknown or future values of other variables. • Description Methods • Find human-interpretable patterns that describe the data.

Test Set Model Classification Example categorical categorical continuous class Learn Classifier Training Set

Classifying Galaxies Courtesy: http://aps.umn.edu • Attributes: • Image features, • Characteristics of light waves received, etc. Early • Class: • Stages of Formation Intermediate Late • Data Size: • 72 million stars, 20 million galaxies • Object Catalog: 9 GB • Image Database: 150 GB

What is Clustering? • Given a set of objects, each having a set of attributes, and a similarity measure among them, find clusters such that • Objects in one cluster are more similar to one another. • Objects in separate clusters are less similar to one another. • Similarity Measures: • Euclidean Distance if attributes are continuous. • Other Problem-specific Measures.

Clustering of S&P 500 Stock Data • Observe Stock Movements every day. • Clustering points: Stock-{UP/DOWN} • Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day. • We used association rules to quantify a similarity measure.

Association Rule Discovery: Definition • Given a set of records each of which contain some number of items from a given collection; • Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

(A B) (C) (D E) (A B) (C) (D E) <= xg >ng <= ws <= ms Sequential Pattern Discovery: Definition • Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. • Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints.

I : Introduction to Data Mining

I : Introduction to Data Mining

Presentation Transcript

Data Mining: Introduction

Data Mining

Data Mining

Data Mining

Data Mining: An Introduction

Data Mining and Medical Informatics

Data Mining

CHAPTER 17: DATA MINING BASICS

Introduction to data mining

CHAPTER 17: DATA MINING BASICS

Data Mining Introduction

Data Mining with DB

Spatial and Temporal Data Mining

Data Mining: Extracting Knowledge from Past Data

Data Mining Course

Data Mining: Introduction

Chapter 1 Introduction to Data Mining

CIT 858: Data Mining and Data Warehousing

CS499/699-10 Data Mining