170 likes | 464 Views
1. Data Mining (or KDD). Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad). Let us find something interesting!. Why Mine Data? Scientific Viewpoint.
E N D
1. Data Mining (or KDD) Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad) Let us find something interesting!
Why Mine Data? Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) • remote sensors on a satellite • telescopes scanning the skies • microarrays generating gene expression data • scientific simulations generating terabytes of data • GIS • Traditional techniques infeasible for raw data • Data mining may help scientists • in classifying and segmenting data • in Hypothesis Formation
2.1 Supervised Clustering Ch. Eick Attribute2 Attribute2 Attribute2 class 1 class 2 unclassified object class 1 class 2 A unclassified object E I J G F B C K L D Attribute1 H Attribute1 Attribute1 a. Unsupervised Clustering b. Semi-supervised Clustering c. Supervised Clustering Applications of Supervised Clustering Include: • Learning Subclasses • for Region Discovery in Spatial Datasets • Distance Function Learning • Data Set Compression (reduce size of dataset by using cluster representatives) • Adaptive Supervised Clustering
Example: Finding Subclasses Ch. Eick Attribute1 Ford Trucks :Ford :GMC GMC Trucks GMC Van Ford Vans Ford SUV Attribute2 GMC SUV
SC Algorithms Investigated • Representative-based Clustering Algorithms • Supervised Partitioning Around Medoids (SPAM). • Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR). • Supervised Clustering using Evolutionary Computing (SCEC) • Agglomerative Hierarchical Supervised Clustering (AHSC) • Grid-Based Supervised Clustering (GRIDSC) • Naïve approach • Hierarchical Grid-based Clustering relying on data cubes • Grid-based Clustering relying on density estimation techniques
2.2 Spatial Data Mining (SPDM) • SPDM := the process of discovering interesting, useful, non-trivial patterns from (large) spatial datasets. • Spatial patterns • Spatial outlier, discontinuities • bad traffic sensors on highways • Location prediction models • model to identify habitat of endangered species • Spatial clusters • crime hot-spots , poverty clusters • Co-location patterns • identify arsenic risk zones in Texas and determine if there is a correlation between the arsenic concentrations of the major Texas aquifers and cultural factors such population, farm density and the geology of the aquifers etc. Idea: Reuse the supervised clustering algorithms that already exist by running them with a different fitness function that corresponds to a particular measure of interestingness.
Example: Discovery of “Interesting Regions” in Wyoming Census 2000 Datasets Ch. Eick
2.3 Distance Function Learning Example: How to Find Similar Patients? Task: Construct a distance function that measures patient similarity Motivation: Finding a “good” distance function is important for: • Case based reasoning • Clustering • Instance-based classification (e.g. nearest neighbor classifiers) Our Approach: Learn distance functions based on training examples and user feedback
Motivating Example: How To Find Similar Patients? The following relation is given (with 10000 tuples): Patient(ssn, weight, height, cancer-sev, eye-color, age,…) • Attribute Domains • ssn: 9 digits • weight between 30 and 650; mweight=158 sweight=24.20 • height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2 • cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor • eye-color: {brown, blue, green, grey } • age: between 3 and 100; mage=45 sage=13.2 Task: Define Patient Similarity
Idea: Coevolving Clusters and Distance Functions Weight Updating Scheme / Search Strategy Clustering X Distance Function Q Cluster “Bad” distance function Q1 “Good” distance function Q2 q(X) Clustering Evaluation o o o x x o x o o o x o o o Goodness of the Distance Function Q o o x x x x x x
Distance Function Learning Framework Distance Function Evaluation Weight-Updating Scheme / Search Strategy Current Research [CHEN05] K-Means [ERBV04] Inside/Outside Weight Updating Supervised Clustering Work By Karypis Randomized Hill Climbing NN-Classifier Adaptive Clustering Other Research … [BECV05] …
Ch. Eick Clustering Supervised Clustering Algorithm Summary Inputs Changes Adaptation System Evaluation System Feedback Domain Expert Past Experience Quality Fitness Functions (Predefined) q(X), … 2.4 Adaptive Data Mining
2.5 Signatures of Data Sets Input: a set of classified examples Output: Signatures in the dataset that characterize • how the examples of a class distribute (in relationship to the examples of the other classes) in the dataset • how many regions dominated by a single class exist in the data set • which regions dominated by one class are bordering regions dominated by another class? • where are the regions, identified in step 2 and 3, located • what are the density attactors (maxima of the density function) of the classes in the data set Why are we creating those signatures? • As a preprocessing step to develop smarter classifiers • To understand why a particular data mining techniques works well / do not work well for a particular dataset meta learning Methods employed: density estimation techniques, supervised clustering, proximity graphs (e.g. Delaunay, Gabriel graphs),…
Example: Signatures of Data Sets Attribute2 Attribute2 Attribute2 class 1 class 2 unclassified object class 1 class 2 A unclassified object E I J G F B C K L D Attribute1 H Attribute1 Attribute1 a. Unsupervised Clustering b. Semi-supervised Clustering c. Supervised Clustering
Applications of Creating Signatures: • Class Decomposition (see also [VAE03]) Attribute 1 Attribute 1 Attribute 2 Attribute 2 Attribute 1 Attribute 2
2.6 Research Christoph F. Eick 2005-2007 Clustering for Classification Creating Signatures For Datasets Editing / Data Set Compression Supervised Clustering Distance Function Learning Spatial Data Mining Adaptive Clustering Mining Data Streams Online Data Mining Mining Sensor Data Measures of Interestingness Evolutionary Computing Mining Semi-Structured Data Web Annotation File Prediction
3. UH Data Mining and Machine Learning Group (UH-DMML)Co-Directors: Christoph F. Eick and Ricardo Vilalta Goal: Development of data analysis and data mining techniques and the application of these techniques to challenging problems in physics, geology, astronomy, environmental sciences, and medicine. Topics investigated: • Meta Learning • Classification and Learning from Examples • Clustering • Distance Function Learning • Using Reinforcement Learning for Data Mining • Spatial Data Mining • Knowledge Discovery