260 likes | 423 Views
DATA MINING. Handling Missing Attribute Values and Knowledge Discovery. Shahzeb Kamal Olov Junker Amsterdam Uppsala. DATA MINING. Process of Extracting Information from a Database & transforming it into understandable structure Why? Because data in real world is ugly Incomplete
E N D
DATA MINING Handling Missing Attribute Values and Knowledge Discovery Shahzeb Kamal Olov Junker AmsterdamUppsala HASCO 2014 - SHAH - OLOV
DATA MINING Process of Extracting Information from a Database & transforming it into understandable structure Why? Because data in real world is ugly Incomplete Contain errors Inconsistent HASCO 2014 - SHAH - OLOV
Handling Missing Attribute values Techniques Consistency Various Algorithms KDD (Knowledge Discovery in Databases) Exploring patterns in datasets Core of KDD is DM HASCO 2014 - SHAH - OLOV BMWL
Missing Data Data is not always available Machine malfunction Inconsistent with other recorded data Data not entered due to misunderstanding Certain data may not be considered important at the time of entry Mistakenly Changed/Erased data HASCO 2014 - SHAH - OLOV
How to handle missing data Goal: Rule induction Sequential method Pre process the data / Fill the missing attributes before main process (e.g. rule induction) (Rule induction: extracting rules from observing the data) Parallel method Extracting rules from the original incomplete data sets HASCO 2014 - SHAH - OLOV
Sequential methods Case wise deletion HASCO 2014 - SHAH - OLOV
Most common value of an attribute Most common value of an attribute restricted to a concept (concept: set of all cases with same decision) Eg. Case 1 belong to concept {1, 2, 4, 8} So, Headache = Yes HASCO 2014 - SHAH - OLOV
Assigning all possible values to a missing attribute value All possible values of an attribute restricted to a concept HASCO 2014 - SHAH - OLOV
Assigning mean value For symbolic attributes -> replace by common value ^_^ HASCO 2014 - SHAH - OLOV
Assigning mean value restricted to a concept For symbolic attributes -> replace by common value HASCO 2014 - SHAH - OLOV
Global Closest Fit We compute ‘Distance’ -> Smallest distance = Closest value • Replacing the missing attribute value by the known value in another “case” that resembles as much as possible to the case with the missing value. Distance(x,y)= Σall cases distance(xi, yi) Where, 0 if xi=yi, Distance(xi,yi) 1 if x and y are symbolic and xi≠yi, or xi=? Or yi=? |xi-yi|/r if xi and yi are numerical values and xi≠yi HASCO 2014 - SHAH - OLOV
For example Distance(1,2) = |100.2-102.6|/|102.6-96.4| +1 + 1 = 2.39 HASCO 2014 - SHAH - OLOV
Concept Closest Fit • First split the dataset in data subsets with same concept • Replace the missing attribute value by the known value in another “case” that resembles as much as possible to the case with the missing value. • Merge the data subsets HASCO 2014 - SHAH - OLOV
Other methods of filling the missing values • Number of methods to handle missing attribute values based on the dependence between known and missing values. • Chase algorithm: for each case with missing data, a net data subset is created. Missing value = decision value, then merge the data set. • Maximum likelihood estimation. • Monte Carlo method: missing values are replaced by many possible values and then the complete data set is analyzed and the results are combined. HASCO 2014 - SHAH - OLOV
Parallell methods • subsets, then rule induction • 2 types of missing values: • ”lost”: needed but gone • ”do not care”: irelevant
Concepts All cases with the same decision value C1 = {1,2,4,8} C2 = {3,5,6,7} HASCO 2014 - SHAH - OLOV
Parallell method, ”Lost” values • ”Lost” values does not belong to any block. • Blocks: colored, same value for certain attribute • e.g. [(Temp, high)] = {1,4,5} • [(Nausea, yes)] = {2,4,5,7} • Characteristic sets: • Intersection of blocks containing a certain case. • e.g. K(4) = {4}, K(5) = {4,5} • Use these to create lower and upper approximations of concepts: • Lower({1,2,4,8}) = {1,2,4} • Upper({1,2,4,8}) = {1,2,4,6,8} • -> Rule induction HASCO 2014 - SHAH - OLOV
Parallell method, ”Do not care” values • ”Do not care” values belong to every block. • Blocks: colored, same value for certain attribute • e.g. [(Temp, high)] = {1,3,4,5,8} • [(Nausea, yes)] = {2,4,5,7,8} • Characteristic sets: • Intersection of blocks containing a certain case. • e.g. K(4) = {4,5,8}, K(5) = {4,5,8} • Use these to create lower and upper approximations of concepts: • Lower({1,2,4,8}) = {2,8} • Upper({1,2,4,8}) = {1,2,3,4,5,6,8} • -> Rule induction HASCO 2014 - SHAH - OLOV
Rule Induction - MLEM2 algorithm, • Rules: trained to describe cases, induced from the decicion table • Possible rules, from upper approximation of a concept • Certain rules, from lower approximation of a concept • Missing values are lost: • Possible: (Temp, normal) -> (Flu, no) • (Headache, no) -> (Flu, no) • Certain: (Temp, high) & (Nausea, no) -> (Flu, yes) • (Headache, yes) & (Nausea,yes) -> (Flu, yes) HASCO 2014 - SHAH - OLOV
KDD • Organized and automatized process of exploring patterns in large data sets • More general than Data Mining • The core of KDD is DM
9 steps • Iteraive & Interactive • Not yet: Best solution to each kind of problem at each step.
Step 1: Understand and specify the goals of the end user • Preprocessing part: • Step 2: Select and create the data set • Step 3: Preprocessing and cleaning => enhance reliabiliy • Step 4: Data trasformation => better data for DM
Data Mining part: • Step 5: Choosing the apprpriate DM task • Step 6: Choosing the DM algorithm, precision vs understandability • Step 7: Employing the DM algorithm • Step 8: Evaluate and interpret mined patterns
Step 9: Using the discovered knowledge • Success of the entire KDD determined by this • Challenges, e.g. loosing lab conditions
END HASCO 2014 - SHAH - OLOV