Instance Construction via Likelihood-Based Data Squashing

Instance Construction via Likelihood-Based Data Squashing Madigan D.,et. al. (Ch 12, Instance selection and Construction for Data Mining (2001), Kruwer Academic Publishers) Summarize: Jinsan Yang, SNU Biointelligence Lab

Abstract • Data Compression Method: Squashing • LDS: Likelihood based data squashing • Keywords Instance Construction, Data Squashing

Outline • Introduction • The LDS Algorithm • Evaluation: Logistic Regression • Evaluation: Neural Networks • Iterative LDS • Discussion

Introduction • Massive data examples • Large-scale retailing • Telecommunications • Astronomy • Computational biology • Internet logging • Some computational challenges • Need of multiple passes for data access • 10^5~6 times slower than main memory • Current Solution:Scaling up existing algorithm • Here: Scaling down the data • Data squashing: 750000  8443 ( DuMouchel et al (1999), • Outperforms by a factor of 500 in MSE than random sample of size 7543

LDS Algorithm • Motivation: Bayesian rule • Given three data points d1,d2,d3, estimate the parameter : • Clusters by likelihood profile:

LDS Algorithm • Details of LDS Algorithm • [Select] Values of by a central composite design Central composite Design for 3 factors

LDS Algorithm • [Profile] Evaluate the likelihood profiles • [Cluster] Cluster the mother data in a single pass • Select n’ random samples as initial cluster centers • Assign the remaining data to each cluster • [Construct] Construct the Pseudo data: • cluster center

Evaluation: Logistic Regression • Small-scale simulations: • Initial estimate of • Plot: Log (Error Ratio) • Three methods of initial parameter estimations • 100 data / 48 squashed data

Evaluation: Logistic Regression • Medium Scale: 100000 , base: 1% simple random sampling

Evaluation: Logistic Regression • Large Scale: 744963 , base: 1% simple random sampling

Evaluation: Neural Networks • Feed forward, two input nodes, one hidden layer with 3 units, Single binary output • Mother data: 10000, Squashed data: 1000, repetitions:30 test data: 1000 from the same network • Comparisons for P(whole) - P(reduced)

Evaluation: Neural Networks

Iterative LDS • When the estimation of is not accurate. 1. Set from simple random sampling • 2. Squash by LDS • 3. Estimate • 4. Go to 2.

Instance Construction via Likelihood-Based Data Squashing

Instance Construction via Likelihood-Based Data Squashing

Presentation Transcript

Instance Based Learning

Instance-Based Learning

Instance Based Learning

Instance Based Learning

Data Reduction via Instance Selection

Instance Based Learning

Instance Based Learning

Instance-Based Learners

Instance Based Learning

Instance Based Approach

Instance Based Learning

Instance-based Classification

Instance based learning

Instance-Based Learning

Instance-Based Learning

Instance Based Learning

Instance-Based Learning

Instance-based Classification

Instance Construction via Likelihood-Based Data Squashing

Instance Based Learning

Data Reduction via Instance Selection

Instance Based Learning