DATA MINING

DATA MINING Handling Missing Attribute Values and Knowledge Discovery Shahzeb Kamal Olov Junker AmsterdamUppsala HASCO 2014 - SHAH - OLOV

DATA MINING Process of Extracting Information from a Database & transforming it into understandable structure Why? Because data in real world is ugly Incomplete Contain errors Inconsistent HASCO 2014 - SHAH - OLOV

Handling Missing Attribute values Techniques Consistency Various Algorithms KDD (Knowledge Discovery in Databases) Exploring patterns in datasets Core of KDD is DM HASCO 2014 - SHAH - OLOV BMWL

Missing Data Data is not always available Machine malfunction Inconsistent with other recorded data Data not entered due to misunderstanding Certain data may not be considered important at the time of entry Mistakenly Changed/Erased data HASCO 2014 - SHAH - OLOV

How to handle missing data Goal: Rule induction Sequential method Pre process the data / Fill the missing attributes before main process (e.g. rule induction) (Rule induction: extracting rules from observing the data) Parallel method Extracting rules from the original incomplete data sets HASCO 2014 - SHAH - OLOV

Sequential methods Case wise deletion HASCO 2014 - SHAH - OLOV

Most common value of an attribute Most common value of an attribute restricted to a concept (concept: set of all cases with same decision) Eg. Case 1 belong to concept {1, 2, 4, 8} So, Headache = Yes HASCO 2014 - SHAH - OLOV

Assigning all possible values to a missing attribute value All possible values of an attribute restricted to a concept HASCO 2014 - SHAH - OLOV

Assigning mean value For symbolic attributes -> replace by common value ^_^ HASCO 2014 - SHAH - OLOV

Assigning mean value restricted to a concept For symbolic attributes -> replace by common value HASCO 2014 - SHAH - OLOV

Global Closest Fit We compute ‘Distance’ -> Smallest distance = Closest value • Replacing the missing attribute value by the known value in another “case” that resembles as much as possible to the case with the missing value. Distance(x,y)= Σall cases distance(xi, yi) Where, 0 if xi=yi, Distance(xi,yi) 1 if x and y are symbolic and xi≠yi, or xi=? Or yi=? |xi-yi|/r if xi and yi are numerical values and xi≠yi HASCO 2014 - SHAH - OLOV

For example Distance(1,2) = |100.2-102.6|/|102.6-96.4| +1 + 1 = 2.39 HASCO 2014 - SHAH - OLOV

Concept Closest Fit • First split the dataset in data subsets with same concept • Replace the missing attribute value by the known value in another “case” that resembles as much as possible to the case with the missing value. • Merge the data subsets HASCO 2014 - SHAH - OLOV

HASCO 2014 - SHAH - OLOV

Other methods of filling the missing values • Number of methods to handle missing attribute values based on the dependence between known and missing values. • Chase algorithm: for each case with missing data, a net data subset is created. Missing value = decision value, then merge the data set. • Maximum likelihood estimation. • Monte Carlo method: missing values are replaced by many possible values and then the complete data set is analyzed and the results are combined. HASCO 2014 - SHAH - OLOV

Parallell methods • subsets, then rule induction • 2 types of missing values: • ”lost”: needed but gone • ”do not care”: irelevant

Concepts All cases with the same decision value C1 = {1,2,4,8} C2 = {3,5,6,7} HASCO 2014 - SHAH - OLOV

Parallell method, ”Lost” values • ”Lost” values does not belong to any block. • Blocks: colored, same value for certain attribute • e.g. [(Temp, high)] = {1,4,5} • [(Nausea, yes)] = {2,4,5,7} • Characteristic sets: • Intersection of blocks containing a certain case. • e.g. K(4) = {4}, K(5) = {4,5} • Use these to create lower and upper approximations of concepts: • Lower({1,2,4,8}) = {1,2,4} • Upper({1,2,4,8}) = {1,2,4,6,8} • -> Rule induction HASCO 2014 - SHAH - OLOV

Parallell method, ”Do not care” values • ”Do not care” values belong to every block. • Blocks: colored, same value for certain attribute • e.g. [(Temp, high)] = {1,3,4,5,8} • [(Nausea, yes)] = {2,4,5,7,8} • Characteristic sets: • Intersection of blocks containing a certain case. • e.g. K(4) = {4,5,8}, K(5) = {4,5,8} • Use these to create lower and upper approximations of concepts: • Lower({1,2,4,8}) = {2,8} • Upper({1,2,4,8}) = {1,2,3,4,5,6,8} • -> Rule induction HASCO 2014 - SHAH - OLOV

Rule Induction - MLEM2 algorithm, • Rules: trained to describe cases, induced from the decicion table • Possible rules, from upper approximation of a concept • Certain rules, from lower approximation of a concept • Missing values are lost: • Possible: (Temp, normal) -> (Flu, no) • (Headache, no) -> (Flu, no) • Certain: (Temp, high) & (Nausea, no) -> (Flu, yes) • (Headache, yes) & (Nausea,yes) -> (Flu, yes) HASCO 2014 - SHAH - OLOV

KDD • Organized and automatized process of exploring patterns in large data sets • More general than Data Mining • The core of KDD is DM

9 steps • Iteraive & Interactive • Not yet: Best solution to each kind of problem at each step.

Step 1: Understand and specify the goals of the end user • Preprocessing part: • Step 2: Select and create the data set • Step 3: Preprocessing and cleaning => enhance reliabiliy • Step 4: Data trasformation => better data for DM

Data Mining part: • Step 5: Choosing the apprpriate DM task • Step 6: Choosing the DM algorithm, precision vs understandability • Step 7: Employing the DM algorithm • Step 8: Evaluate and interpret mined patterns

Step 9: Using the discovered knowledge • Success of the entire KDD determined by this • Challenges, e.g. loosing lab conditions

END HASCO 2014 - SHAH - OLOV

DATA MINING

DATA MINING

Presentation Transcript

Data Mining

DATA MINING

Data Mining

Data Mining

Data Mining: Data

Data Mining

DATA MINING

Data Mining: Data

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data Mining: Data

Data Mining

Data Mining: Data

Data-mining

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data