1 / 19

COMP 3503 Data Preparation and Meta Data

COMP 3503 Data Preparation and Meta Data. with Daniel L. Silver. The KDD Process. Interpretation and Evaluation. Data Mining. Knowledge. Selection and Pre-processing. p(x)=0.02. Data Consolidation. Patterns & Models. Prepared Data. Warehouse. Consolidated Data. Data Sources.

cecilf
Download Presentation

COMP 3503 Data Preparation and Meta Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP 3503Data Preparation and Meta Data with Daniel L. Silver

  2. The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Pre-processing p(x)=0.02 Data Consolidation Patterns & Models Prepared Data Warehouse Consolidated Data Data Sources

  3. Selection and Pre-processing Core Problems & Approaches • Problems: • identification of relevant data • representation of data • search for valid pattern or model • Approaches: • top-down verification by expert • interactive visualization of data/models • * bottom-up inductionfrom data * Probability of sale Age Income OLAP Data Mining

  4. Selection and Pre-processing • As much effort is expended preparing data as applying a data mining tool • Iterative approach: prepare develop data model • Data Mining phase will benefit from any insight that leads to improved set of attributes • Representation can facilitate or frustrate the Search for the most accurate model (hypothesis) Spreadsheet, OLAP and visualization tools are very helpful

  5. Selection and Pre-processing Data and Variable Characteristics • Three basic variable data types: • Nominal (catagorical) qualitative values marital status = single, married, divorced, widowed • Ordinal (ranked) values have rank order grade = A,B,C • Interval values have order plus a metric scale for comparisons and arithmetic operations temperature = 2, 10, 20 or 10.5, 15.2, 19.3 date = 12Aug99, 13Feb02

  6. Selection and Pre-processing Data and Variable Characteristics • Variables can be either discrete or continuous • Only interval numeric values are continuous • In addition data can be of various formats: • Text, numeric, logical, binary, date, money, … • Data mining software will vary in its ability to accept these types and formats

  7. Selection and Pre-processing Data Selection and Sampling • Select response (dependent) variable • determine prior probability of categories • deal with volume bias issues • Select predictor (independent) attributes • Generate a set of examples • choose sampling method (random, stratified) • consider sample complexity: How many examples do I need to develop a reliable model? • Handle outliers (obvious exceptions) • Remove the row or replace with imputed value

  8. Selection and Pre-processing Data Reduction • The curse of dimensionality • number of attributes / number of values • Reduce number of attributes • remove redundant and correlating attributes • combine attributes (logically, arithmetically, statistically (Principal Components Analysis) • Reduce attribute value ranges • group symbolic discrete values • quantize continuous numeric values

  9. pos r neg r ? Y X X X Selection and Pre-processing Preliminary Statistical Analysis • Coefficient of correlation , r, measures the linear dependence of two variables X and Y, -1 < r < +1; r shows magnitude of r. • Select attributes which correlate strongly with the response variable and pertain to problem 2

  10. Selection and Pre-processing Preliminary Statistical Analysis • Remove or combine attributes that correlate with each other or try to de-correlate through transformation • Factor Analysis - ANOVA can be used to compare relative contribution of each attribute to outcomes • Principal Component Analysis - generates variates - linear combinations of original attributes Tools such as Minitab, SAS, SPSS can be used

  11. Selection and Pre-processing • Transform data • de-correlate and normalize values • map time-series data to static representation • Encode data • representation must be appropriate for the Data Mining tool which will be used • continue to reduce attribute dimensionality where possible without loss of information Use spreadsheet functions or transformation and encoding software within DM tool

  12. Selection and Pre-processing Transformation and Encoding Discrete variable values • If necessary transform to discrete numeric values • Example, encode the value 4 as follows: • Nominal: one-of-N code (0 1 0 0 0) - five inputs • Ordinal: thermometer code ( 1 1 1 1 0) - five inputs • Interval: real value (0.4)* - one input • Consider relationship between values • (single, married, divorce) vs. (youth, adult, senior)

  13. Selection and Pre-processing Transformation and Encoding Continuous numeric values • De-correlate via normalization of values: • Min-max: x’ = [(newmax – newmin) (x – min) / (max – min)] + newmin • Euclidean: x’ = x / sqrt(sum of all x^2) • Percentage: x’ = x/(sum of all x) • Variance based: x’ = (x - (mean of all x))/variance • Scale values using a linear transform if data is uniformly distributed or use non-linear (log, power) if skewed distribution

  14. Selection and Pre-processing Transformation and Encoding Other encodings for continuous numeric values Example: 1.6 meters could be encoded as: • Single real-valued number (0.16)* • OK! But what if data is skewed • Bits of a binary number (010000) • BAD! Modeling system must now learning binary encoding • One-of-N quantized intervals (0 1 0 0 0) • NOT GREAT! Presents discontinuties • Distributed (fuzzy) overlapping intervals ( 0.3 0.8 0.1 0.0 0.0) • BEST! Deals well with skewing but no discontinuities

  15. Selection and Pre-processing Extracting Features from a Single Variable • From dates: • Day, week, month, quarter, holiday, weekend day • From time: • Hour, minute, morning, afternoon, evening • From address: • Postal Code components mean something • Telephone number: • NPA-NNX-9999

  16. Selection and Pre-processing Time Series Data • Of great interest to business, science, medicine • Time series data has high dimensional • T1, T2, … , Tn • Approaches to summary/characterization • Current value = Ti • Moving average = MAi = ( Ti + Ti-1 + Ti-2) / 3 • Trends = Ti - MAi or = MAi - MAi-k

  17. Selection and Pre-processing Textual Data • A difficult data type • freeform, open-ended, syntax -vs- semantics • Can have very high dimensions • thousands of potential values • Approaches to summary/characterization • define a fixed set of N word classes based on frequency analysis • map word combinations to one of the N classes • automate via specialty software

  18. Data Warehousing and Preparation Access to Recent Information • www.datawarehousing.com • DWI - Data Warehouse Institute www.dw-institute.com • Wikipedia http://en.wikipedia.org/wiki/Data_warehouse • DW Information Centre http://www.dwinfocenter.org • A DW Tutorial: http://www.planet-source-code.com/vb/scripts/ShowCode.asp?lngWId=5&txtCodeId=378 • Text Books: Data Warehousing texts by W.H. Inmon, Claudia Imhoff, Ralph Kimball D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.

  19. THE ENDdanny.silver@acadiau.ca

More Related