190 likes | 202 Views
Explore a novel multivariate discretization approach, MVD, improving pattern discovery by considering interactions between variables. Efficiently merge intervals with similar distributions, uncovering hidden insights. Experiment results show high accuracy and comparable runtime.
E N D
Multivariate Discretization of Continuous Variables for Set Mining Author:Stephen D. Bay Advisor: Dr. Hsu Graduate: Kuo-wei Chen
Outline • Motivation • Objective • Introduction (1)~(2) • Multivariate Discretization Approach(1)~(5) • Experiment (1)~(6) • Conclusions • Opinion
Motivation • Most discretization method are univariate and consider only a single feature at a time.This is a sub-optimal approach for knowledge discovery as univariate discretization can destroy hidden patterns in data.
Objective • To describe why univariate is scarcely comparable to multivariate. • Present a bottom up merging algorithm that is called “MVD” • Present an experiment to prove that MVD’s execute time is more efficient than other univariate approaches.
Introduction(1) • In Knowledge Discovery , to promote predictive accuracy is not the most important thing. • The emphasis is previously unknown and insightful patterns. • The discretized intervals should not hide patterns. • The intervals should be semantically meaningful. • Multivariate discretization one considers how all the variables interact before deciding on discretized intervals.
Introduction(2) • Example
Multivariate Discretization Approach(1) • Past Discretization Approaches • Univariate • Miss interactions of several variables • Executable Time is long: O(n2) • Many Rules
Multivariate Discretization Approach(2) • STUCCO • Find large differences between two probability distributions • The mining objectives of STUCCO P(C|G1) p(C|G2) ……(1) |support(C|G1) support(C|G2)| ……(2) • Control the merging process.
Multivariate Discretization Approach(3) • Algorithm Step 1.Partition all continuous attributes into n basic intervals 2.Merging adjacent intervals X and Y where they have the minmum combined support. 3.If Fx~Fy then merge X and Y. 4.If there are no eligible intervals stop.Otherwise go to 2.
Multivariate Discretization Approach(4) • Efficiency • STUCCO runs efficientl on many datasets. • The problems STUCCO are often easier than that faced by the main mining program. • Only to find single difference between the groups • Calling STUCCO repeatedly will result in many passes over the database.
Multivariate Discretization Approach(5) • Sensitivity to hidden Patterns • Parity R+I • Eexample
Experiment(1) • Sun Ultra-5 with 128MB • Parameter settings
Experiment(2) • Discretization Time in CPU seconds
Experiment(3) • Qualitative Results • Discretization Cutpoints for Age on the Adult Census Data
Experiment(4) • Qualitative Results • Discretization Cutpoints for Capital-Loss on the Adult Census Data
Experiment(5) • Qualitative Results • Discretization Cutpoints for Parental Income on the UCI Admission Data
Experiment(6) • Qualitative Results • Discretization Cutpoints for GPA on the UCI Admission Data
Conclusions • The MVD algorithm can finely partitions continuous variables and then merges adjacent intervals continuous variables only if their instances have similar multivariate distributions. • Experimental results indicate that the MVD algorithm detect high dimensional interactions between feature and discretize the data appropriately. • The MVD algorithm run in time comparable to a popular univariate recursive approach.
Opinion • If the adjacent intervals don’t have similar distributions between them , then MVD algorithm won’t be efficient. Generally ,this condition is usually occurred.