160 likes | 175 Views
A computational tool for statistical research based on data depth notion, aiding scientists without a computer science background. Offers fast, portable code with a simple interface for data manipulation and analysis, including depth contours and statistical modules. Future plans involve enhancing functionalities and incorporating user feedback.
E N D
A computational tool fordepth-based Statistical analysis Eynat Rafalin, Tufts University Computer Science Department
The tool • Easy to use, efficient and expandable interface, for statistical research, based on the notion of data depth. • For scientists with no computer science background.
Our goal • Present the tool to the community • Code\software available on request • Run on real data • Get feedback • Is such a tool needed? • Additions\improvements?
General • C++ based software (no additional tools\software needed) • Simple interface. Should allow to • enter data files, sort the data points and filter unwanted data • perform calculations • present the results in an easy to understand graphical interface • Save and output data for future use • Fast • Portable code
General description Data filter txt, excel files output Statistical modules Geomview Contours display and selection
Data filter • Graphical user interface developed in C++ • Used to crop\manipulate a data set before it is fed into the statistical modules • Fast and light • Convenient and easy to use user interface • Portable code (UNIX, Solaris, Linux, Win)
Statistical modules Depth contours (2D) • Half-space (location) depth contours • optimal O(n2) time • Supports two approaches for defining contours • Including Tukey median and the bagplot • Including contours’ parameters (size, etc..) • Convex hull peeling depth contours • Simplicial depth contours • Tukey median computation (O(nlog3n)) • Locating a new point in a set of depth contours (O(log n) query time)
Approaches for defining depth contours • P. Rousseeuw et al. • The k-th depth contour is the boundary of the set of points in the plane with depth k • R. Liu et al. (based on order statistics) • The sample p-th central hull is the convex hull containing the most central fraction p sample points.
Half-space (location) depth contours module Depth contours for a sample set with 8 data points Depth contours for a data set describing diabetic patients
Statistical modules – cntd. Plots • DD (Depth vs. Depth) plots • O(n2) time • Shrinkage plots • Fan plots
DD (Depth vs. Depth) plots module Depth according to set A Depth according to set B Two 2D data sets of 50 points each, created from normal distribution, centered at (0,0), with different covariance matrices (1 and 4 id).
Fan plots Relative area (CH of p%/CH) Percentile of points 50 data points, created from a random distribution, with covariance matrix 4 times identity. The fans are created for data sets containing the 1/6, 2/6, ..central regions. For each region the area of the CH of 2, 4, 6,…% of the points is computed.
Graphical contour selection tool • Plots depth contours and selects data ranges. • Actions • Import\export • Select points • Depth slider • Filter
Future work • Run the tool on existing data sets • Distribute preliminary versions and get users feedback • Data filter • Group by row\column • Filter by row\column • Interactions between rows\columns (addition, substitution, logical operations) • Statistical modules • Implement additional modules • Improve running times
Contributors • Prof. Diane Souvaine • Prof. Alva Couch • Eynat Rafalin • Michael Burr • Joe Handelman • James Hayes • Ori Taka • Alok Lal • Janet Luan • Kim Miller • Tim Mitchell • Nikolai Shvertner