870 likes | 1.35k Views
Computing In Research. Dr. S.N. Pradhan Professor, CSE Department. Agenda. Introduction Data analysis and Visualization Interactive Data language(IDL) Scilab & Scicos Symbolic Computation Mathematica / maxima. A Data Analysis Pipeline.
E N D
Computing In Research Dr. S.N. Pradhan Professor, CSE Department
Agenda • Introduction • Data analysis and Visualization • Interactive Data language(IDL) • Scilab & Scicos • Symbolic Computation • Mathematica/ maxima
A Data Analysis Pipeline Raw data Processed data Hypothesis or Model Results D Cleaning Filtering Transforming Statistical Analysis Pattern Rec Knowledge Disc Validation A B C
Where can visualization come in • All stages can benefit from visualization • A: identify bad data, select subsets, help choose transforms (exploratory) • B:help choose computational techniques, set parameters, use vision to recognize, isolate, classify patterns (exploratory) • C: Superimpose derived models on data (confirmatory) • D:Present results (presentation)
What decides how to visualize • Characteristics of data • Types, size, structure • Semantics, completeness, accuracy • Characteristics of user • Perceptual and cognitive abilities • Knowledge of domain, data, tasks, tools • Characteristics of graphical mappings • What are possibilities • Which convey data effectively and efficiently • Characteristics of interactions • Which support the tasks best • Which are easy to learn, use, remember
Issues Regarding Data • Type may indicate which graphical mappings are appropriate • Nominal vs. ordinal • Discrete vs. continuous • Ordered vs. unordered • Univariate vs. multivariate • Scalar vs. vector vs. tensor • Static vs. dynamic • Values vs. relations • Trade-offs between size and accuracy needs • Different orders/structures can reveal different features/patterns
User perceptions • What graphical attributes do we perceive accurately? • What graphical attributes do we perceive quickly? • Which combinations of attributes are separable? • Coping with change blindness • How can visuals support the development of accurate mental models of the data? • Relative vs. absolute judgements – impact on tasks
Issues regarding mappings • Variables include shape, size, orientation, color, texture, opacity, position, motion…. • Some of these have an order, others don’t • Some use up significant screen space • Sensitivity to occlusion • Domain customs/expectations
Issues regarding Interactions • Interaction critical component • Many categories of techniques • Navigation, selection, filtering, reconfiguring, encoding, connecting, and combinations of above • Many “spaces” in which interactions can be applied • Screen/pixels, data, data structures, graphical objects, graphical attributes, visualization structures
Importance of Evaluation • Easy to design bad visualizations • Many design rules exist – many conflict, many routinely violated • 5 E’s of evaluation: effective, efficient, engaging, error tolerant, easy to learn • Many styles of evaluation (qualitative and quantitative): • Use/case studies • Usability testing • User studies • Longitudinal studies • Expert evaluation • Heuristic evaluation
Mappings • Based on data characteristics • Numbers, text, graphs, software, …. • Logical groupings of techniques (Keim) • Standard: bars, lines, pie charts, scatterplots • Geometrically transformed: landscapes, parallel coordinates • Icon-based: stick figures, faces, profiles • Dense pixels: recursive segments, pixel bar charts • Stacked: treemaps, dimensional stacking
Mappings • Based on dimension management (Ward) • Dimension subsetting: scatterplots, pixel-oriented methods • Dimension reconfiguring: glyphs, parallel coordinates • Dimension reduction: PCA, MDS(Multi Dimensional Sclaing), Self Organizing Maps • Dimension embedding: dimensional stacking, worlds within worlds
Sensor Network SENSOR LAB AT BERKELEY
Pairwise link quality Link Quality Distance Between Nodes
Dimensional Stacking • Break each dimension range into bins • Break the screen into a grid using the number of bins for 2 dimensions • Repeat the process for 2 more dimensions within the subimages formed by first grid, recurse through all dimensions • Look for repeated patterns, outliers, trends, gaps
Methods to cope with scale • Many modern datasets contain large number of records (millions and billions) and/or dimensions (hundreds and thousands) • Several strategies to handle scale problems • Sampling • Filtering • Clustering/aggregation • Techniques can be automated or user-controlled
Visualization a powerful component of the data analysis process • Each stage of analysis can be enhanced • Visualization can help guide computational analysis, and vice versa • Multiple linked views and a rich assortment of interactions key to success
Numerical Recipes in C & C++ • Numerical Recipes in C is a collection (or a library) of C functions written by Press et al. • Library of mathematical functions • Useful while doing process or system modeling. • Break down the model in known mathematical functions and then one can use routines.
GNU Scientific Library • Basic mathematical functionsComplex numbers • Polynomials • Special functions • Vectors and matrices • PermutationsCombinations
MultisetsSorting • Linear algebra • Eigensystems • Fast Fourier transforms • Numerical integration (based on QUADPACK) • Random number generation • Quasi-random sequences • Random number distributions • Statistics • Histograms
Interactive Data language • Data manipulation and visualization • Commercially availble packages IDL from ITT Visual Information System. • Consists of • Data Analysis • Data visualization • Animation
Open Source IDL(GDL) • Open Source equivalent of IDL and much more • GDL is used particularly in geosciences. • GDL is dynamically-typed, vectorized and has object-oriented programming capabilities. • The library routines handle numerical calculations, data visualisation, signal/image processing, interaction with host OS and data • input/output. GDL supports several data formats such as netCDF, • HDF4, HDF5, GRIB, PNG, TIFF, DICOM, etc. Graphical output is handled by X11, PostScript, SVG or z-buffer terminals
Part II • Analysis may, therefore, be categorized as • Descriptive analysis • Inferential analysis (often known as statistical analysis). • Correlation analysis • Causal analysis (regression analysis) • Multivariate analysis
Descriptive analysis • Descriptive analysis” is largely study of distributions of one variable. This study provides us with profiles of companies, work groups, persons and other subjects on any of a multiple of characteristics such as size. Composition, efficiency, preferences, etc.” This sort of analysis may be in respect of one variable (described as unidimensional analysis); or in respect of two variables (described as bivariate analysis) or in respect of more than two variables (described as multivariate analysis). In this context we work out various measures that show the size and shape of a distribution(s) along with the study of measuring relationships between two or more variables.
Correlation analysis • Correlation analysis studies the joint variation of two or more variables for determining the amount of correlation between two or more variables. In most social and business researches interest lies in understanding and controlling relationships between variables and so correlation analysis are relatively more important.
Causal Analysis • Causal analysis (regression analysis) is concerned with study of how one or more variables affect changes in another variable. It is a study of functional relationships existing between two or more variables. Causal analysis is considered relatively more important in experimental researches.
Multivariate analysis • Multivariate analysis is defined as “all statistical methods which simultaneously analyze more than two variables on a sample of observations”. With the availability of computer facilities, there has been a rapid development of this kind of analysis.
Multiple regression analysis: This analysis is adopted when the researcher has one dependent variable which is presumed to be a function of two or more independent variables. The objective of this analysis is to make a prediction about the dependent variable based on its covariance with all the concerned independent variables.
Multiple discriminant analysis: This analysis is appropriate when the researcher has a single dependent variable that cannot be measured, but can be classified into two or more groups on the basis of some attribute. The object of this analysis happens to be to predict an entity’s possibility of belonging to a particular group based on several predictor variables.
Multivariate analysis of variance (or multi-ANOVA): This analysis is an extension of two way ANOVA, wherein the ratio of among group variance to within group variance is worked out on a set of variables.
Canonical analysis: This analysis can be used in case of both measurable and non-measurable variables for the purpose of simultaneously predicting a set of dependent variables from their joint covariance with a set of independent variables.
Analysis of variance (ANOVA) is a useful technique concerning researches in the fields of economics, biology, education, psychology, sociology, business/industry and several other disciplines. This technique is used when multiple sample cases are involved. The significance of difference between the means of two samples can be judged through either z-test or the t-test, but the difficulty arises when we happen to examine the significance of the difference amongst more than two sample means at the same time. The ANOVA technique enables us to perform this simultaneous test and as such is considered to be an important tool of analysis in the hands of a researcher.
Role of Simulation Simulation is a method and application to mimic the real system, mostly via computer. Simulation is a numerical technique for conducting experiments on a computer which involves logical and mathematical relationships that interact to describe the behavior and structure of a complex real-world system over extended period of times.
Simulation makes it possible to study and experiment with the complex internal interactions of a given system. • It provides better understanding of the system. • Simulation can be used as a pedagogical device for teaching students. • The experience of designing a computer simulation model may be more valuable than the designing of actual model.
Simulation can be used to experiment with new situation about which we have little or no information available. • To verify analytical solution. • Cheap - No need of costly equipments • Complex scenarios can be easily tested • Results can be quickly obtained • More ideas can be tested in smaller time limit
Pitfalls of simulation • It cannot provide insight for all possible scenarios Eg. Mobile networks must be tested with different mobility models • Failure to have a well-defined set of objectives at the beginning of the simulation study. • Failure to communicate with the decision-maker (or the client) on a regular basis. • Lack of knowledge of simulation methodology and also of probability and statistics.
Real systems too complex to model leading to inappropriate level of model detail. • Failure to collect good system data. • Belief that so-called ”easy-to-use” simulation packages require a significantly lower level of technical competence. • Blindly using simulation software without understanding its underlying assumptions. • Replacing a probability distribution by its mean. • Failure to perform a proper output-data analysis
Simulation Checklist • Checks before developing simulation • Is the goal properly specified? • Is detail in model appropriate for goal? • Does team include right mix (leader, modeling, programming, background)? • Has sufficient time been planned? • Checks during simulation development • Is random number random? • Is model reviewed regularly? • Is model documented?
Checklist cont… • Checks after simulation is running • Is simulation length appropriate? • Are initial transients removed? • Has model been verified? • Has model been validated? • Are there any surprising results? If yes, have they been validated?
Terminology • State variables • Variables whose values define current state of system • Saving can allow simulation to be stopped and restarted later by restoring all state variables • Event • A change in system state • Ex: Three events: arrival of job, beginning of new execution, departure of job
Continuous-time and discrete-time models • If state defined at all times continuous • If state defined only at instants discrete • Ex: class that meets M-F 2-3 is discrete since not defined other times • Continuous-state and discrete-state models • If uncountably infinite continuous • Ex: time spent by students on hw • If countable discrete • Ex: jobs in CPU queue • Note, continuous time does not necessarily imply continuous state and vice-versa • All combinations possible
Deterministic and probabilistic models • If output predicted with certainty deterministic • If output different for different repetitions probabilistic • Ex: For proj1, dog type-1 makes simulation deterministic but dog type-2 makes simulation probabilistic
Static and dynamic models • Time is not a variable static • If changes with time dynamic • Ex: CPU scheduler is dynamic, while matter-to-energy model E=mc2 is static • Linear and nonlinear models • Output is linear combination of input linear • Otherwise nonlinear
Open and closed models • Input is external and independent open • Closed model has no external input • Ex: if same jobs leave and re-enter queue then closed, while if new jobs enter system then open • Stable and unstable • Model output settles down stable • Model output always changes unstable
Selecting Simulation language • Four choices: simulation language, general-purpose language, extension of general purpose, simulation package • Simulation language – built in facilities for time steps, event scheduling, data collection, reporting • General-purpose – known to developer, available on more systems, flexible • The major difference is the cost tradeoff – simulation language requires startup time to learn, while general purpose may require more time to add simulation flexibility