110 likes | 318 Views
The KDD Process for Extracting Useful Knowledge from Volumes of Data. Fayyad, Piatetsky -Shapiro, and Smyth Ian Kim SWHIG Seminar. Overview. What can we gain from data? Business and marketing applications Public p olicy decision-making Scientific research
E N D
The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar
Overview • What can we gain from data? • Business and marketing applications • Public policy decision-making • Scientific research • Why do we need the KDD process? • Increasing use of data analytics • Size of databases involved • Being able to access raw data isn’t enough
Part 1:Selection • Formulating the target dataset • What kinds of records to consider? • Desired fields? • Incorporates domain knowledge • Background knowledge in relevant field • Goals of the dataset
Part 2:Pre-processing • Preparing raw data for transformation • Removal of noise, outliers • Strategy for handling missing records • Missing/unknown value mappings
Part 3:Transformation • Data reduction • Grouping to reduce number of variables considered • Aggregation to higher row unit • Useful representations of data • Summary statistics
Part 4:Data Mining • Selection of data model • Summarization, classification, clustering, regression analysis • Searching for patterns in data
Part 5:Interpretation • Interpreting the model used in the previous step • Check results if they make sense • Consider different models, returning to prior steps • Utilize the obtained results
Challenges of KDD • Massive datasets • Algorithmic efficiency, approximation, parallel processing • Making interaction possible for analysts • Develop better tools that allow for human-computer interaction • Overfitting, measures of significance • Testing on randomly chosen sections • Missing or invalid data • Strategies to identify hidden variables and dependencies • Making data understandable by humans • Improved data visualization methods
Challenges of KDD • Rapidly changing data • Incrementally updating discovered patterns • Integration • Coordinating database tools (OLAP) and data mining tools • Nonstandard data (e.g. multimedia) • “Beyond the scope of current KDD technology”
Conclusion • Emerging nature of KDD & data mining fields • Human interaction still necessary • Incorporating machines to cope with scale of data • Improve tools to make better decisions using data