1.04k likes | 1.2k Views
Data Mining: How to make islands of knowledge emerging out of oceans of data. Hugues Bersini IRIDIA - ULB. PLAN. Rapid intro to data warehouse data mining: two super techniques of data mining incomprehensible :. Understand and predict. Lazy for time series prediction.
E N D
Data Mining:How to make islands ofknowledge emerging out of oceans of data Hugues Bersini IRIDIA - ULB
PLAN • Rapid intro to data warehouse • data mining: • two super techniques of data mining incomprehensible: Understand and predict Lazy for time series prediction Bagfs for classification
The Data Miner Steps • Data Warehousing • Data Preparation • Cleaning + Homogeneisation • Transformation - Composition • Reduction • For time series: time adjustment • Data Modelling : What researchers are mainly interested in.
Re-organization of data • Subject oriented • integrated • transversals • with history • non volatile • from production data ---> to decision-based data
Data Mining Uunderstand and predict
Modelling the data: only if structure and regularities in the dataData mining IS NOT OLAP WHY ?? To predict new data To understand the data
The main techniques of data-mining • Clustering • Outlier detection • Association analysis • Forecasting • Classification
Data Mining: to understand and/or to predict discovering structure in data discovering I/O relationship in data
Nothing new under the sun • New methods extending old ones in the domain of non-linear (NN) and symbolic (decision tree) • Exponential explosion of data • Extracting from huge data base More sensitive than ever
Exploit Decisions Data store Data Store Data Store Data Store • Data volume doubles every 18 months • world-wide • Problem • How to extract relevant knowledge for • our decisions from such amounts of data? • Solutions • Throw it away before using it (most popular) • Query it (Query and OLAP tools) • Summarize it: extract essence from the bulk • according to targeted decision (Data Mining) CEDITI September 2, 1998 • 3
Discovering structure in data • When in a space with a metric • Hierarchical clustering • K-Means • NN clustering - Kohonen’s map • In space without any metric but a cost function: • Grouping Genetic Algorithms ....
Market Basket Analysis: Association analysis Quantity bought
Discovering I/O relationship in data ? y x(t) ?? x t classification time series prediction O = the class I = (x,y) O = x(t+1) I = x(t) understanding I/O relationship Predicting which O for new I
Le CV d’IRIDIA en data mining • Reconnaissance de défauts vitreux chez Glaverbel • Prediction de fluctuations boursières avec MasterFood et dieteren • Reconnaissance d’incidents et prédiction de charge électrique avec Tractebel • Analyse des retards aériens avec Eurocontrôle • Modélisation de Processus Industriel avec Honeywell, FAFER et Siemens • Moteur de recherche Internet convivial avec la Region Wallonne • Classification de pixels pour les images de satelittes
Financial prediction Task: predict the future trends of the financial series. Goal: automatic trading system to anticipate the fluctuations of the market.
Economic variables Task: predict how many cars will be matriculated next year. Goal: support the marketing campaign of a car dealer.
Modeling of industrial plants Rolling steel mill Task: predict the flow stress of the steel plate as a function of the chemical and physical properties of the material. Goal: cope with different types of metals, reduce the production time and improve final quality.
Control Waste water treatment plant Task: model the dynamics of the plant on the basis of accessible information. Goal: control the level of water pollutants.
Environmental problems Algae summer blooming Task: predicting the biological state (e.g. density of algae communities) as a function of chemicals. Goal: make automatic the analysis of the state of the river by monitoring chemical concentrations.
In the medical domain • automatic diagnosis of cancer • detection of respiratory problems • electrocardiogram analysis • help to paraplegic
APPLICATION DU DATA MINING DANS LE DOMAINE DU CANCER: Application à l'aide au diagnostic et au pronosticen pathologie tumorale. En collaboration avec le Laboratoire d'Histopathologie (R. Kiss), Faculté de Médecine, U.L.B.
critères histologiques: - perte de différenciation - invasion critères cytologiques: - taille des noyaux - mitoses - plages d’hyperchromatisme bilan clinique patient tumeur chirurgie DIAGNOSTIC (pathologistes) traitement adjuvant faible, modéré, élevé Amélioration du diagnostic Adéquation du traitement Augmentation de la survie
“Objectivation” d’éléments diagnostiques quantification de critères (cytologiques et histologiques) microscopie assistée par ordinateur traitement des données Extraction d’informations diagnostiques et/ou prognostiques fiables et reproductibles
500 à 1000 noyaux • par tumeurs. • 30 variables tumorales: • moyenne • déviation standard
On internet • The Hyperprisme project • Text Mining • Automatic profiling of users • Key words: positif, negatif,… • Automatic grouping of users on the basis of their profiles • See Web
Different approaches Data Model Non readable Accuracy of prediction Non comprehensible Comprehensible SVM Local Global
Understanding and Predicting Building Models A model needs data to exist but, once it exists, it can exist without the data. Structure To fit the data Model Parameters Linear, NN, Fuzzy, ID3, Wavelet, Fourier, Polynomes,...
From data to prediction RAW DATA TRAINING DATA PREPROCESSING MODEL LEARNING PREDICTION
Supervised learning input PHENOMENON output error OBSERVATIONS MODEL prediction • Finite amount of noisy observations. • No a priori knowledge of the phenomenon.
Model learning MODEL GENERATION PARAMETRIC IDENTIFICATION MODEL VALIDATION STRUCTURAL IDENTIFICATION MODEL SELECTION
The Practice of Modelling Accurate Simple Robust Understandable good for decision Data + Optimisation Methods THE MODEL Physical Knowledge Engineering Models Rules of Thumb Linguistic Rules
Comprehensible models • Decision trees • Qualitative attributes • Force the attributes to be treated separately • classification surfaces parallel to the axes • good for comprehension because they select and separate the variables
Decision trees • Very used in practice. One of the favorite data mining methods • Work with noisy data (statistical approaches) can learn logical model out of data expressed by and/or rules • ID3, C4.5 ---> Quinlan • Favoring little trees --> simple models
At every stage the most discriminant attribute • The tree is being constructed top-down adding a new attribute at each level • The choice of the attribute is based on a statistical criteria called : “the information gain” • Entropie = -pouilog2poui - pnonlog2pnon • Entropie = 0 if Poui/non = 1 • Entropie = 1 if Poui/non = 1/2
Information gain • S = set of instances, A set of attributes and v set of values of attributes A • Gain (S,A) = Entropie(S)-Sv|Sv|/|S|*Entropie(Sv) • the best A is the one that maximises the Gain • The algorithm runs in a recursive way • The same mechanism is reapplied at each level
Mais !!!! Remboursement d’emprunt Is a good client if (x - y)>30000 . Salaire mensuel 30000
Other comprehensible models • Fuzzy logic • Realize an I/O mapping with linguistic rules • If I eat “a lot” then I take weight “a lot”
Exemple trivial Linéaire, optimal automatique, simple Y X