310 likes | 438 Views
Moon HUH 1 , KwangRyeol SONG 2 , YoungSuk PARK 1, KyungWook Shim 1 Sungkyunkwan University, Seoul, Korea 2 Kwansei Research Institute , Seoul, Korea. Data Exploration with DAVIS. Purpose of DAVIS. to visually explore the structure or pattern of data. Components of DAVIS.
E N D
Moon HUH1, KwangRyeol SONG2, YoungSuk PARK1, KyungWook Shim 1Sungkyunkwan University, Seoul, Korea 2 Kwansei Research Institute, Seoul, Korea Data Exploration with DAVIS Variable Selection
Purpose of DAVIS to visually explore the structure or pattern of data Variable Selection
Components of DAVIS Data Manipulation Statistical Tools Plots Graphic Controllers Variable Selection
Data Manipulation • Observation/variable selection • Focusing/deleting a subset of data set • Missing value process • Discretization Variable Selection
Plots - Univariate • Bar Charts • Histogram • QQ Plot • FEDF • BoxPlot • Parallel Coordinates Variable Selection
BoxPlot: Features • Standardization • Indentification Variable Selection
Parallel Coordinates: Features • Direction of Plotting: Horizontal / Vertical • Ordering of the Variables: Component / Permutation • Jittering Variable Selection
Parallel Coordinates -options Variable Selection
Plots-multivariate • Scatterplot • Loess curve fitting • Touring • Dendrogram • Line Mosaic Plot • PCA plot Variable Selection
Scatterplot-options Variable Selection
Touring –GrandTour/Tracking Variable Selection
Dendrogram –Agglomeration /Distance options Variable Selection
Line Mosaic Plot –for discrete data Variable Selection
PCA plot Variable Selection
Real time grouping with DAVIS - hiliting • Manually grouping the data set into 2 subsets by mouse brushing a subset of data • Always can go back to the original data set Variable Selection
Real time grouping with DAVIS–deleting/focusing Variable Selection
Interactive Clustering with DAVIS-linking Variable Selection
Clustering with DAVIS: EM with 3 groups Variable Selection
Coloring a subset –outlier detection Variable Selection
Touring with DAVIS- Tracking • Can investigate multidimensional structure of the data Variable Selection
Data exploration with Decision Trees-Titanic data Variable Selection
Decision Trees-2 Variable Selection
Variable selection with DAVIS • Target (Class) variable discrete (nominal) type • Candidate variables nominal, numerical, and complex type Variable Selection
Variable subset selection methods • MDI( Lee and Huh, 2003) . using p-values for the test statistics between the 2 variables. –log (p-value) is suggested • ReliefF (Kira and Randell, 1992) Relief (x)=P[different value of X | different class] - P[different value of X | same class] • Mutual Information (originated by Shanon, 1948 and used for the measure of dependence by Perez, 1957, Russian) Darbellay (1999, CSDA) gives a good survey on the measure of statistical dependenceusing MI Variable Selection
Subset selection with DAVIS– ranking variables • MDI(meaured of departure from indep.) • ReliefF • MI (measure of Information) Variable Selection
Subset selection with DAVIS-decision trees • Discretization required Variable Selection
Subset Selection with DAVIS- stepwise discriminant analysis • Continuous variables only • Good under normality Variable Selection
Subset Selection with DAVIS- Mutual Information • Conventional approach: Discretization required • Normal mixture approach: Good for continuous variables • Incremental Algorith: Good for complex data Variable Selection
Variable Selection with DAVIS-design layout Variable Selection
Variable selection –titanic data • Variable ranking: sex, class, age • subset selection: age, class Variable Selection
Concluding remarks • DAVIS is a Java-based system • Any statistical model can be added to the system as a visual component if it follows certain rules. • Need more efficient design layout for various strategies of variable selection. • Need to coin easier-to-understand terminologies for various elements of the component. Variable Selection