The KDD Process for Extracting Useful Knowledge from Volumes of Data

The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar

Overview • What can we gain from data? • Business and marketing applications • Public policy decision-making • Scientific research • Why do we need the KDD process? • Increasing use of data analytics • Size of databases involved • Being able to access raw data isn’t enough

The KDD Process

Part 1:Selection • Formulating the target dataset • What kinds of records to consider? • Desired fields? • Incorporates domain knowledge • Background knowledge in relevant field • Goals of the dataset

Part 2:Pre-processing • Preparing raw data for transformation • Removal of noise, outliers • Strategy for handling missing records • Missing/unknown value mappings

Part 3:Transformation • Data reduction • Grouping to reduce number of variables considered • Aggregation to higher row unit • Useful representations of data • Summary statistics

Part 4:Data Mining • Selection of data model • Summarization, classification, clustering, regression analysis • Searching for patterns in data

Part 5:Interpretation • Interpreting the model used in the previous step • Check results if they make sense • Consider different models, returning to prior steps • Utilize the obtained results

Challenges of KDD • Massive datasets • Algorithmic efficiency, approximation, parallel processing • Making interaction possible for analysts • Develop better tools that allow for human-computer interaction • Overfitting, measures of significance • Testing on randomly chosen sections • Missing or invalid data • Strategies to identify hidden variables and dependencies • Making data understandable by humans • Improved data visualization methods

Challenges of KDD • Rapidly changing data • Incrementally updating discovered patterns • Integration • Coordinating database tools (OLAP) and data mining tools • Nonstandard data (e.g. multimedia) • “Beyond the scope of current KDD technology”

Conclusion • Emerging nature of KDD & data mining fields • Human interaction still necessary • Incorporating machines to cope with scale of data • Improve tools to make better decisions using data

The KDD Process for Extracting Useful Knowledge from Volumes of Data

The KDD Process for Extracting Useful Knowledge from Volumes of Data

Presentation Transcript

Data mining is the process of automatically extracting valid, novel, potentially useful and ultimately comprehensible in

Extracting Collection Data From Websites

Standards and gene expression data – from data archiving to extracting biological knowledge

Intro to Data Mining: Extracting Information and Knowledge from Data

Overview of Data Mining and the KDD Process

Extracting features from spatio-temporal volumes (STVs) for activity recognition

Data Mining / KDD

Extracting data

Extracting structure information from data

DATA MINING Extracting Knowledge From Data

Extracting Useful and Targeted State-Level Data from IPEDS

Data Mining: A KDD Process

Extracting Schema From Data

Extracting Schema from Semistructured Data

Knowledge Discovery in Data [and Data Mining] (KDD)

Data Mining: Extracting Knowledge from Past Data

Extracting Regional Knowledge from Spatial Datasets

Extracting knowledge from the World Wide Web

Extracting hidden information from knowledge networks

Extracting Products Data from Homedepot

Product Data Extracting from Safeway

Extracting Review Data From Amazon