180 likes | 315 Views
Visual Exploratory Data Analysis: HCE. In this lecture you learn. Analysis and communication of multi-dimensional data sets Gene expression data (multi-dimensional) from micro-array experiments HCE – hierarchical clustering explorer GRID Principles Rank by Feature Framework. Introduction.
E N D
In this lecture you learn • Analysis and communication of multi-dimensional data sets • Gene expression data (multi-dimensional) from micro-array experiments • HCE – hierarchical clustering explorer • GRID Principles • Rank by Feature Framework Dept. of Computing Science, University of Aberdeen
Introduction • Multi-dimensional data sets are studied in many domains • Micro-array data sets from genomics • Aggregate data sets from census • Several techniques proposed for their analysis • Principal Component Analysis (PCA) • Online Analytical Processing OLAP • Data mining Algorithms including Clustering • With new data sets, exploratory data analysis is recommended before using the above techniques Dept. of Computing Science, University of Aberdeen
HCE • A visual knowledge discovery tool for analysing and understanding multi-dimensional (> 3D) data • Offers multiple views of • input data and clustered input data • where views are coordinated • Like all modern information visualization tools HCE is • Highly interactive allowing user • to control the visual displays and • to query data visually • handles very very large data sets (data from genomics) • Many other similar tools do a patch work of statistics and graphics • HCE follows two fundamental statistical principles of exploratory data analysis • To examine each dimension first and then find relationships among dimensions • To try graphical displays first and then find numerical summaries Dept. of Computing Science, University of Aberdeen
GRID Principles • GRID – graphics, ranking and interaction for discovery • Two principles • Study 1D, study 2D and find features • Ranking guides insight, statistics confirm • These principles help users organize their knowledge discovery process • Because of GRID, HCE is more than SPSS + Visualization • GRID can be used to derive some scripts to organize exploratory data analysis using SPSS (or some such statistics package) Dept. of Computing Science, University of Aberdeen
Rank-by-Feature Framework • A user interface framework based on the GRID Principles • The framework • Uses interactive information visualization techniques combined with • statistical methods and data mining algorithms • Enables users to orderly examine input data • HCE implements rank-by-feature framework • This means • HCE uses existing statistical and data mining methods to analyse input data and • Communicate those results using interactive information visualization techniques Dept. of Computing Science, University of Aberdeen
Multiple Views in HCE • Dendrogram • Colour Mosaic • 1 D histograms • 2D scatterplots • And more Dept. of Computing Science, University of Aberdeen
Micro-array Experiments • Functional Genomics is a field of study in molecular biology and genetics to connect • Genome sequence data to genome function • DNA micro-array is a glass or nylon substrate with specific DNA gene samples spotted in an array • Also known as gene arrays or gene chips • Micro-array data is used in genomics for understanding the function of genes • A flash movie on DNA micro-array methodology at http://www.bio.davidson.edu/courses/genomics/chip/chip.html Dept. of Computing Science, University of Aberdeen
DNA Micro-array Data • Gene samples from experiments are ‘hybridized’ with micro-array genes • The experimental gene sample binds with variable strengths to different genes on the gene array • The strength of binding is measured as gene expression data • Several such gene expression data sets from several experiments are tabulated to form a multi-dimensional data set Dept. of Computing Science, University of Aberdeen
Micro-array Data (2) Samples • Micro-array data has several thousands of rows and columns • Rows (i) correspond to genes • Columns (j) correspond to samples from different experiments • An element a(i,j) has the gene expression (strength) value of the jth sample on the ith gene on the array 1 i n 1 j m G e n e s a(i,j) Dept. of Computing Science, University of Aberdeen
Hierarchical Clustering • Researchers use clustering to discover interesting patterns in gene expression data • Clustering is the process of grouping data with similar properties • There are many algorithms for clustering with different behaviours • It is hard to know whose results agree well with natural clusters in the input data • Hierarchical clustering produces a hierarchical structure of clusters rather than a set of clusters Dept. of Computing Science, University of Aberdeen
Hierarchical Agglomerative Clustering (HAC) • Is a bottom-up clustering algorithm very similar to the bottom-up segmentation you studied • 1. Initially, each data item is a cluster by itself • 2. Cluster pairs of items with maximum similarity (based on a pre-selected similarity metric) • 3. Compute the similarity values between the new cluster and the others • 4. Repeat 2 and 3 until all the items are grouped into one cluster. Dept. of Computing Science, University of Aberdeen
Dendrogram Display • Results of HAC are shown visually using a dendrogram • A dendrogram is a tree • with data items at the terminal (leaf) nodes • Distance from the root node represents similarity among leaf nodes • Two visual controls • minimum similarity bar allows users to adjust the number of clusters • Detail cut-off bar allows users to reduce clutter D C A B Dept. of Computing Science, University of Aberdeen
Colour Mosaic • Input data is shown using this view • Is a colour coded visual display of tabular data • Each cell in the table is painted in a colour that reflects the cell’s value • Two variations • The layout of the mosaic is similar to the original table • A transpose of the original layout • HCE uses the transposed layout because data sets usually have more rows than columns • A colour mapping controls Table Original layout Transposed Layout Dept. of Computing Science, University of Aberdeen
1D Histogram Ordering • This data view is part of the rank-by-feature framework • Data belonging to one column (variable) is displayed as a histogram + box plot • Histogram shows the scale and skewness • Box plot shows the data distribution, center and spread • For the entire data set many such views are possible • By studying individual variables in details users can select the variables for other visualizations Dept. of Computing Science, University of Aberdeen
2D Scatter Plot Ordering • This data view is again part of the rank-by-feature framework • Three categories of 2D presentations are possible • Axes of the plot obtained from Principal Component Analysis • Linear or non-linear combinations of original variables • Axes of the plot obtained directly from the original variables • Parallel coordinates • HCE uses the second option of plotting pairs of variables from the original variables • Both 1D and 2D plots can be sorted according to some user selected criteria such as number of outliers Dept. of Computing Science, University of Aberdeen
Conclusion • HCE is a very good example of data interpretation and communication technology • Performs data analysis using statistical methods and clustering • Communicates the results of data analysis visually • HCE has many other features that have not been described here • GRID and rank-by-feature framework are useful ideas and can be used while using other data analysis tools such as SPSS Dept. of Computing Science, University of Aberdeen
Text output Segmentation of input data using existing data mining algorithm Comparatively very little user interaction (except control data) Integration of text and data analysis Works well in a limited domain Graphical Output Clustering of input data using existing data mining algorithm Highly interactive based on well tested HCI principles Integration of graphics and data analysis A generic tool (at least it is claimed to be generic) SumTime-Mousam vs HCE Dept. of Computing Science, University of Aberdeen