200 likes | 254 Views
Knowledge discovery process. Chapter 1. Juha Vesanto Juha.Vesanto@hut.fi. Starting point!. Data exploration starts with data. ?. The real starting point!. Data exploration starts with data. ?. Data exploration starts with identifying a need. ?. !. Customer. Problem owners
E N D
Knowledge discovery process Chapter 1 Juha Vesanto Juha.Vesanto@hut.fi
Starting point! Data exploration starts with data. ?
The real starting point! Data exploration starts with data. ? Data exploration starts with identifying a need. ? !
Customer • Problem owners • Problem holders • Useful • Profitable Participation Motivation
The process (Pyle) Exploring the problem Exploring the solution Implementation specification Preparation Survey Data modeling 20% work 80% importance 80% work 20% importance
The problem • Identify the right problem • Define solvable problem(s) • Transfer the problem understanding to the miner
Example “I really need a model of the Monday and Friday failure rates so we can stop them!” • What is a failure? • How it is detected/measured? • Is it a quality problem or just fluctuation of error rates? • Which problem components need to be looked at? • ...
The solution What does the solution look like? - a program used by an expert - a data set to be referred to - a model to be used for prediction - a presentation / report - ... How (and by whom) is the solution implemented?
Data mining • Prepare: • both the data and the miner • Survey: • understand the data • is the data adequate? • Model: • refining the details • depends on nature of data and the solution goal
Why preparation? GIGO: fix the data Get a data set which is of maximum use preserves the information enhanced for problem & model
new data PIE Prepared Information Environment 1. prepare the training/testing data 2. transform prepared values to original 3. apply the same preparation to new data PIE-in data model PIE-out report
Why survey? Get a broad idea of the data: • what is covered • what is not covered, or is covered poorly Dangerous areas: • bias in data • sparse data (in a dynamic area) Is the data adequate?
Modeling hype Universal approximator can be applied to any data Data-driven no theoretical knowledge required
Modeling definition Model: “a representation … to show the construction or serve as a copy of something” = makes information understandable or usable =
Modeling in data mining Modeling is iterative: 1. Define problem 2. Select tool 3. Collect data 4. Make model 5. Apply 6. Evaluate Traditional statistical methods: first model, then data
Model types • Active or passive • Explanatory or predictive • Static or continuously learning
1. Select clear problem with tangible benefit 2. Specify required solution 3. Define how solution is implemented 4. Understand the domain 6. Stipulate assumptions 5. Let the problem drive the modeling 7. Refine the model iteratively 8. Make the model as simple as possible (but no simpler) 9. Find areas of instability 10. Find areas of uncertainty Ten golden rules
Critique • Model evaluation is missing • Iteration of planning stage • Domain expert as data miner