60 likes | 225 Views
Data Mining: Data. Lecture 3 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö. Data quality. GIGO – Garbage In, Garbage Out Effectiveness of DM exercise depends on the quality of data Data quality concerns individual measurements (records and fields) collections of observations
E N D
Data Mining: Data Lecture 3 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö
Data quality • GIGO – Garbage In, Garbage Out • Effectiveness of DM exercise depends on the quality of data • Data quality concerns • individual measurements (records and fields) • collections of observations • Sources of error are infinite • Human error (e.g., keyboard error) • Instrumentation failure • Inaccuare or imprecise • Inadequate specification of measurement or data collection process
Quality of individual measurements • Bias • the difference between the mean of the repeated measurements and the true value • Precision • variability of the repeated measurements (NOTE: precision is not the number of digits in record) • Accuracy • small bias and high precision (e.g., small variance) • e.g, repeated measurement of someone’s height may be precise (reliable), but inaccurate (validity), if (s)he is wearing shoes (we are not measuring the right thing) • True value (does it even exist?)
Quality of collections of data : bias • Distorted (biased) samples • mismatch between the sample population and and the population of interest (selection bias) • e.g., calculating an average age of students in Jyväskylä when the sample is restricted to female students • a sample may be selected through a chain of selection steps • e.g., candidates for bank loans: 1) potential customers are contacted, 2) some reply, some do not, 3) of those who replied some are creditworthy, some are not, 4) those who take out a loan are followed, 5) some are good customers, some are not,… • populations are not static (population drift) • e.g., customers shopping behaviour may change over time • A biased sample leads to inconsistent estimates of population parameters
Quality of collections of data: Incomplete data • Incomplete data: missing or empty values • Missing value: Information is not collected • e.g., People decline to answer a question (age, weight, position,…) • Empty value: Information does not exist • A form may have conditional parts: e.g., expiry date of an driver’s license can not be filled out by children • Determining whether any value is ”empty” or ”missing” requires domain knowledge • If the discriminating information is not provided both empty and missing values are treated as ”and called” missing • Fundamental question for data mining task: ”Why are the data incomplete?” • Note: A distorted (biased) sample is actually a special case of incomplete data