130 likes | 263 Views
Instance Construction via Likelihood-Based Data Squashing. Madigan D., et. al . (Ch 12, Instance selection and Construction for Data Mining (2001), Kruwer Academic Publishers) Summarize: Jinsan Yang, SNU Biointelligence Lab. Abstract Data Compression Method: Squashing
E N D
Instance Construction via Likelihood-Based Data Squashing Madigan D.,et. al. (Ch 12, Instance selection and Construction for Data Mining (2001), Kruwer Academic Publishers) Summarize: Jinsan Yang, SNU Biointelligence Lab
Abstract • Data Compression Method: Squashing • LDS: Likelihood based data squashing • Keywords Instance Construction, Data Squashing
Outline • Introduction • The LDS Algorithm • Evaluation: Logistic Regression • Evaluation: Neural Networks • Iterative LDS • Discussion
Introduction • Massive data examples • Large-scale retailing • Telecommunications • Astronomy • Computational biology • Internet logging • Some computational challenges • Need of multiple passes for data access • 10^5~6 times slower than main memory • Current Solution:Scaling up existing algorithm • Here: Scaling down the data • Data squashing: 750000 8443 ( DuMouchel et al (1999), • Outperforms by a factor of 500 in MSE than random sample of size 7543
LDS Algorithm • Motivation: Bayesian rule • Given three data points d1,d2,d3, estimate the parameter : • Clusters by likelihood profile:
LDS Algorithm • Details of LDS Algorithm • [Select] Values of by a central composite design Central composite Design for 3 factors
LDS Algorithm • [Profile] Evaluate the likelihood profiles • [Cluster] Cluster the mother data in a single pass • Select n’ random samples as initial cluster centers • Assign the remaining data to each cluster • [Construct] Construct the Pseudo data: • cluster center
Evaluation: Logistic Regression • Small-scale simulations: • Initial estimate of • Plot: Log (Error Ratio) • Three methods of initial parameter estimations • 100 data / 48 squashed data
Evaluation: Logistic Regression • Medium Scale: 100000 , base: 1% simple random sampling
Evaluation: Logistic Regression • Large Scale: 744963 , base: 1% simple random sampling
Evaluation: Neural Networks • Feed forward, two input nodes, one hidden layer with 3 units, Single binary output • Mother data: 10000, Squashed data: 1000, repetitions:30 test data: 1000 from the same network • Comparisons for P(whole) - P(reduced)
Iterative LDS • When the estimation of is not accurate. 1. Set from simple random sampling • 2. Squash by LDS • 3. Estimate • 4. Go to 2.