Data Analysis

Data Analysis Yaji Sripada

In this lecture you learn • The objectives of data analysis • Fitting models to the input data according to the end-user requirements • Data Analysis • Tasks and • Methods • Knowledge acquisition (KA) techniques to understand the required data analysis tasks and requirements due to HCI (Human Computer Interaction) • An iterative process for designing data analysis methods • With multiple KA and evaluation studies • Issues with the reuse of data analysis algorithms developed in other fields by matching • Requirements due to HCI • Features of data analysis methods Dept. of Computing Science, University of Aberdeen

Data Analysis Information Visualization Input Data End-User Interaction Introduction • High level architecture of our systems • Data Analysis (DA) • Compute patterns or models (in general abstractions) from raw input data • Information Visualization (infovis) • Present the relevant abstractions (patterns or models) in a form suitable to the end user • Support user interaction (will see examples later) • Integrating Data Analysis and InfoVis is the main focus of this course • Two Options for integration • Option 1 - Loose Coupling • Option 2 – Theory Driven Dept. of Computing Science, University of Aberdeen

Introduction (2) • Loose Coupling • Two libraries of data analysis and infovis are offered to the user • User is given freedom in exploiting the available methods to understand data • Certain constraints may be defined in linking a specific data analysis method with a specific infovis method • Already available in many existing tools such as R, Excel • Theory Driven • InfoVis module defines a HCI theory which guides the user access to data analysis methods and also the visualizations • DA works under two contexts • Domain Context • HCI (Human Computer Interaction) Context • We lump together all HCI related issues here but study them later • Impact of HCI context on DA under investigation Dept. of Computing Science, University of Aberdeen

Introduction (3) • Objective of IE • Make sense of input data • Making sense of data involves fitting a known model to the data • If the fit is successful we say we understand the data • Because we can derive or infer ‘new’ information using the model • Example: Pressure volume data • Fitting a linear trend line • Model: Linear model • Linear models such as this are easy to communicate • Text: ‘There is an inverse relationship between pressure and volume of an ideal gas’ –Boyle’s law • Graph: as shown on the side Dept. of Computing Science, University of Aberdeen

Data Analysis • Data Analysis • Compute ‘meaningful’ abstractions from raw input data • May involve a strategic application of several individual analysis methods • Integrate elementary observations to identify high level abstractions • Results of data analysis are communicated to the end-user using infovis • This means data analysis module needs to be controlled by end-user • Several Iterations of data analysis might be performed by end-user to develop insights into the whole underlying data • Separation of Data analysis tasks and data analysis methods • Data analysis tasks are achieved by data analysis methods • Data analysis is defined by specifying its • input and • output Dept. of Computing Science, University of Aberdeen

Input: Data • Sensor Data • Measurement of something in the real world • E.g. Dive computer data is obtained from a pressure sensor installed on the dive computer and BabyTalk data • Data collected from any form of measurements belongs to this class • Always involves context in which the data can be interpreted • Simulation data • Data generated by a computer simulation • Weather data, pollen data, etc • Does not involve context for interpretation • Quality of input data determines the quality of the output and also determines the effort required for data analysis • Clean data is easier to process and produces high quality output Dept. of Computing Science, University of Aberdeen

Output: Models and Patterns • In general, outputs of data analysis involve representations of abstractions • Models are abstractions that span the whole data set • Global abstractions • Do not model the data generation process which produced the input data • Patterns are abstractions that span portions of a data set • Local abstractions • Output abstractions should be • Simple - such as the linear abstraction computed from the pressure volume data (slide 5) • Global - such as the linear abstraction computed from the pressure volume data (slide 5) • Local abstractions are acceptable in contexts where the user already has a global view of the information • We may not always succeed in fitting global models • We may have to fit models piecewise • Fitting models to subsets (portions) of the input data set • Easy to map to graphical elements in information visualizations (and also to words/phrases in the text sometimes) Dept. of Computing Science, University of Aberdeen

Knowledge Acquisition (KA) • In a conventional exploratory data analysis context, data analysis is a bottom up process or data driven • In our case, data analysis is a top down process or goal driven • Knowledge acquisition (KA) studies (discussed next) identify required data analysis tasks • For designing the data analysis modules we need • knowledge about the application domain and • Knowledge of the user tasks and user’s informational requirements • KA studies to be performed before the design phase • With experts • With users • Case studies • Exploratory data analysis (EDA) • Prototype development Dept. of Computing Science, University of Aberdeen

KA Techniques • Techniques developed in the expert system community • Think aloud sessions • Direct interviews • Studying examples or case studies • Exploratory Data Analysis (EDA) • To understand the data set using data analysis methods from descriptive statistics • Analytical methods • Graphical methods • You used EDA in practical 1 Dept. of Computing Science, University of Aberdeen

Identification of data analysis tasks • KA studies normally produce a list of queries user wants to ask the system such as • What is the typical value in the data set? • What are the outliers? • What is the relationships among the various data items? • What are the portions of the data that fit a given pattern? • What is the model that describes the data? • Queries suggest the required data analysis tasks • System response to queries (individual or grouped) can be viewed as messages about the underlying data • Please note messages can be realized either using graphics or using text Dept. of Computing Science, University of Aberdeen

Simple Example - Analysis of exam marks data • Simple questions to be answered in this case: • What are the maximum and minimum marks? • What is the class average or standard deviation? • Frequency counts • How many failed the exam? • How many got first class? • On which of the questions students performed well/not well? • And so on • Answers to the above questions are the different messages in this application • In this case, the different data analysis tasks are: • compute maximum, • Compute minimum, • Compute average • And other statistics • We can also work out questions users ask of a system helping them to understand the world of digital cameras (example from the introduction lecture) Dept. of Computing Science, University of Aberdeen

Design of Data Analysis Module • Main Steps in the design process • Perform KA studies • Identify the HCI (Human Computer Interaction) features of the user’s interaction with the full system • Single view of output • Interactive views of output • Identify required data analysis tasks from KA studies • For each of the tasks design a data analysis method • Decide about how these methods are controlled • Pipeline or • More sophisticated architectures • If the user wants to interact with the system freely (Loose coupling) • If the user wants to interact with the system according to an HCI theory (Theory driven) Dept. of Computing Science, University of Aberdeen

Design of Data Analysis Module (II) • Consider the contextual effects of other tasks • Each method works in the context of other methods related to other tasks • Unknown territory – more studies required here • Optionally design • A pre-processing method for preparing the raw input data for data analysis • a post-processing method that organizes the results of data analysis as required by the infovis module • Cycle through the above design steps many times • Evaluating the design at the end of every cycle • The above procedure relies a lot on the information from KA studies and evaluation studies • Quality of KA and evaluations is important • KA and evaluations are the hardest tasks of the system building activity Dept. of Computing Science, University of Aberdeen

Evaluation • Independent evaluation of the data analysis module • Using known metrics such as precision and recall • Evaluation of the data analysis module in the context of the whole system • New metrics required to measure the goodness of the system as a whole • Metrics may vary with improvements in technology • Task (user) based evaluations • Studies later during the course • Evaluations are costly • Multiple cheap evaluations often better than one expensive evaluation Dept. of Computing Science, University of Aberdeen

Design of Data Analysis Methods • For each identified data analysis task we need a data analysis method • The actual procedure or algorithm that achieves the task • Data analysis methods are developed in many fields: • Statistics • Data Mining • Pattern Recognition • Machine Learning • We reuse methods developed in the above fields • Sometimes, we assume a library of data analysis methods • Such as R/MatLab Dept. of Computing Science, University of Aberdeen

Statistics • Time tested techniques for primary data analysis • Most of the data analysis tasks in our exam marks example can be achieved by statistical techniques • Two types of techniques • Numerical • Compute statistics such as mean and standard deviation • Good at computing objective and precise descriptions of data • Graphical • Create histograms, stem and leaf displays and box plots • Good at presenting (communicating) the data to humans • Note: Statisticians exploited the power of combining numerical techniques (data analysis) and graphical techniques (infovis) • Work great for analysing smaller data sets - hundreds and thousands of data items not millions and billions • Need for data analysis techniques that process large data sets – millions and billions of data items • Algorithmic implementations of many statistical procedures are available in the form of libraries (for example R) Dept. of Computing Science, University of Aberdeen

Data Mining • Data driven techniques for discovering unsuspected and useful patterns or models from very large data sets - Mega and Giga bytes • Mainly used for secondary analysis of data often without any specific goal • Pure statisticians might call ‘data fishing’ • Largely made up of existing statistical ideas scaled up! • Data Mining does not replace humans • Data mining offers tools to perform data analysis • Like all tools quality of results of data mining depends on the skill of the user • Increases the productivity of the user • User should be good at • Statistics • Computer Science and • Domain knowledge Dept. of Computing Science, University of Aberdeen

Pattern Recognition • Techniques for solving perceptual problems • Image Processing • Speech Processing • In this course we are concerned with simple patterns such as rapid ascent in a scuba dive profile • We will design our own simple pattern detection methods • But in general pattern recognition methods are part of our technology Dept. of Computing Science, University of Aberdeen

Machine Learning • Data analysis techniques for automated learning • Usually the output of learning used by machines not humans • Not studied here (As part of CS5565 Data Mining) Dept. of Computing Science, University of Aberdeen

Reusing data analysis methods • Data analysis methods are normally designed in an idealized mathematical context • The user of these methods (such as R/MatLab methods) is expected to know how to map information from real contexts to this idealized mathematical context • As a result, while reusing data analysis methods we need to map information from our context to the idealized context and map the results back from the idealized context to our context • Note that we use data analysis in a HCI context • This also means designing a data analysis module involves • a search for a method in a library of methods (such as R/MatLab) with a good match between • requirements due to HCI context and • known features of the data analysis methods • Adapt an existing method to suit the user requirements Dept. of Computing Science, University of Aberdeen

Requirements due to HCI • Interactivity – communication using information visualizations (studied later in the course) are essentially interactive • Here, communicating the system’s internal context to the user is important • Multi-modality – based on the abilities or disabilities of users • Gaps in communication – depending on the output modality certain abstractions might be hard to communicate • Users’ informational requirements • level of expertise or prior knowledge • Output size restrictions • limited screen size etc. Dept. of Computing Science, University of Aberdeen

Features of Data Analysis Methods • Configurability • Data analysis methods use parameters that allow users to configure its runtime behaviour • These parameters may not be suitable from the communication perspective • Users may not always be able to specify these parameters accurately • When an ideal fit of parameters is not available we modify these methods with the parameterisation suitable to our contexts • Or look for approximate fits • Level of Abstraction • Data analysis methods abstract the raw input data as stated earlier into either global models or local patterns • The level of abstraction achieved has important consequences for the infovis module • Because the level of abstraction determines the level of detail in the final output • The level of abstraction should be determined by the end-user tasks and end-user’s informational requirements • Size of the final output • One of the major constraints on the design of data analysis module is the size of the output produced by the whole system • Users do not prefer • Large complicated graphics or • Large volumes of text • Again, the user size requirements should determine how much information is computed by the data analysis method Dept. of Computing Science, University of Aberdeen

Many alternative methods • Data Mining community develops multiple methods for achieving the same data analysis task • When several data analysis methods are available, a method that generates abstractions which satisfy user requirements should be preferred • When an ideal fit is not available we choose the method that achieves the best result and make alternatives available for exploration • Making the exploration of multiple methods user friendly is challenging • For complex methods users need a black-box view • For simpler methods users require a glass-box view • We could also implement an adapted version of an existing method • Users could be offered features to control the adaptation process Dept. of Computing Science, University of Aberdeen

Data types • Input data can be of different types: • Single variable • Multi-variable • Time series or • Spatial • And others • Data Analysis methods depend upon the type of the input data • In this course we focus on analysis of • Time series data and • E.g. Scuba dive profile data • Spatial data • E.g georeferenced census data Dept. of Computing Science, University of Aberdeen

Summary • Data analysis module needs to be integrated to the infovis module • Data analysis methods need to be controlled by the user • The success of user control depends upon the success of KA • Better understanding of overall system requirements and its operational context leads to better design • Designing data analysis module involves finding the best possible match between • Requirements due to HCI context and • Features of data analysis methods • In this course we focus on analysis of • Time series data and • Spatial data Dept. of Computing Science, University of Aberdeen

Data Analysis

Data Analysis

Presentation Transcript

Data Analysis

Data analysis

Data analysis

Data Analysis

Data analysis

Data Analysis

DATA ANALYSIS

DATA ANALYSIS

DATA ANALYSIS

DATA ANALYSIS

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

DATA ANALYSIS