1 / 19

Multivariate Discretization of Continuous Variables for Set Mining

Explore a novel multivariate discretization approach, MVD, improving pattern discovery by considering interactions between variables. Efficiently merge intervals with similar distributions, uncovering hidden insights. Experiment results show high accuracy and comparable runtime.

Download Presentation

Multivariate Discretization of Continuous Variables for Set Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multivariate Discretization of Continuous Variables for Set Mining Author:Stephen D. Bay Advisor: Dr. Hsu Graduate: Kuo-wei Chen

  2. Outline • Motivation • Objective • Introduction (1)~(2) • Multivariate Discretization Approach(1)~(5) • Experiment (1)~(6) • Conclusions • Opinion

  3. Motivation • Most discretization method are univariate and consider only a single feature at a time.This is a sub-optimal approach for knowledge discovery as univariate discretization can destroy hidden patterns in data.

  4. Objective • To describe why univariate is scarcely comparable to multivariate. • Present a bottom up merging algorithm that is called “MVD” • Present an experiment to prove that MVD’s execute time is more efficient than other univariate approaches.

  5. Introduction(1) • In Knowledge Discovery , to promote predictive accuracy is not the most important thing. • The emphasis is previously unknown and insightful patterns. • The discretized intervals should not hide patterns. • The intervals should be semantically meaningful. • Multivariate discretization one considers how all the variables interact before deciding on discretized intervals.

  6. Introduction(2) • Example

  7. Multivariate Discretization Approach(1) • Past Discretization Approaches • Univariate • Miss interactions of several variables • Executable Time is long: O(n2) • Many Rules

  8. Multivariate Discretization Approach(2) • STUCCO • Find large differences between two probability distributions • The mining objectives of STUCCO P(C|G1)  p(C|G2) ……(1) |support(C|G1)  support(C|G2)|  ……(2) • Control the merging process.

  9. Multivariate Discretization Approach(3) • Algorithm Step 1.Partition all continuous attributes into n basic intervals 2.Merging adjacent intervals X and Y where they have the minmum combined support. 3.If Fx~Fy then merge X and Y. 4.If there are no eligible intervals stop.Otherwise go to 2.

  10. Multivariate Discretization Approach(4) • Efficiency • STUCCO runs efficientl on many datasets. • The problems STUCCO are often easier than that faced by the main mining program. • Only to find single difference between the groups • Calling STUCCO repeatedly will result in many passes over the database.

  11. Multivariate Discretization Approach(5) • Sensitivity to hidden Patterns • Parity R+I • Eexample

  12. Experiment(1) • Sun Ultra-5 with 128MB • Parameter settings

  13. Experiment(2) • Discretization Time in CPU seconds

  14. Experiment(3) • Qualitative Results • Discretization Cutpoints for Age on the Adult Census Data

  15. Experiment(4) • Qualitative Results • Discretization Cutpoints for Capital-Loss on the Adult Census Data

  16. Experiment(5) • Qualitative Results • Discretization Cutpoints for Parental Income on the UCI Admission Data

  17. Experiment(6) • Qualitative Results • Discretization Cutpoints for GPA on the UCI Admission Data

  18. Conclusions • The MVD algorithm can finely partitions continuous variables and then merges adjacent intervals continuous variables only if their instances have similar multivariate distributions. • Experimental results indicate that the MVD algorithm detect high dimensional interactions between feature and discretize the data appropriately. • The MVD algorithm run in time comparable to a popular univariate recursive approach.

  19. Opinion • If the adjacent intervals don’t have similar distributions between them , then MVD algorithm won’t be efficient. Generally ,this condition is usually occurred.

More Related