260 likes | 448 Views
Data Analysis. Yaji Sripada. In this lecture you learn. The objectives of data analysis Fitting models to the input data according to the end-user requirements Data Analysis Tasks and Methods
E N D
Data Analysis Yaji Sripada
In this lecture you learn • The objectives of data analysis • Fitting models to the input data according to the end-user requirements • Data Analysis • Tasks and • Methods • Knowledge acquisition (KA) techniques to understand the required data analysis tasks and requirements due to HCI (Human Computer Interaction) • An iterative process for designing data analysis methods • With multiple KA and evaluation studies • Issues with the reuse of data analysis algorithms developed in other fields by matching • Requirements due to HCI • Features of data analysis methods Dept. of Computing Science, University of Aberdeen
Data Analysis Information Visualization Input Data End-User Interaction Introduction • High level architecture of our systems • Data Analysis (DA) • Compute patterns or models (in general abstractions) from raw input data • Information Visualization (infovis) • Present the relevant abstractions (patterns or models) in a form suitable to the end user • Support user interaction (will see examples later) • Integrating Data Analysis and InfoVis is the main focus of this course • Two Options for integration • Option 1 - Loose Coupling • Option 2 – Theory Driven Dept. of Computing Science, University of Aberdeen
Introduction (2) • Loose Coupling • Two libraries of data analysis and infovis are offered to the user • User is given freedom in exploiting the available methods to understand data • Certain constraints may be defined in linking a specific data analysis method with a specific infovis method • Already available in many existing tools such as R, Excel • Theory Driven • InfoVis module defines a HCI theory which guides the user access to data analysis methods and also the visualizations • DA works under two contexts • Domain Context • HCI (Human Computer Interaction) Context • We lump together all HCI related issues here but study them later • Impact of HCI context on DA under investigation Dept. of Computing Science, University of Aberdeen
Introduction (3) • Objective of IE • Make sense of input data • Making sense of data involves fitting a known model to the data • If the fit is successful we say we understand the data • Because we can derive or infer ‘new’ information using the model • Example: Pressure volume data • Fitting a linear trend line • Model: Linear model • Linear models such as this are easy to communicate • Text: ‘There is an inverse relationship between pressure and volume of an ideal gas’ –Boyle’s law • Graph: as shown on the side Dept. of Computing Science, University of Aberdeen
Data Analysis • Data Analysis • Compute ‘meaningful’ abstractions from raw input data • May involve a strategic application of several individual analysis methods • Integrate elementary observations to identify high level abstractions • Results of data analysis are communicated to the end-user using infovis • This means data analysis module needs to be controlled by end-user • Several Iterations of data analysis might be performed by end-user to develop insights into the whole underlying data • Separation of Data analysis tasks and data analysis methods • Data analysis tasks are achieved by data analysis methods • Data analysis is defined by specifying its • input and • output Dept. of Computing Science, University of Aberdeen
Input: Data • Sensor Data • Measurement of something in the real world • E.g. Dive computer data is obtained from a pressure sensor installed on the dive computer and BabyTalk data • Data collected from any form of measurements belongs to this class • Always involves context in which the data can be interpreted • Simulation data • Data generated by a computer simulation • Weather data, pollen data, etc • Does not involve context for interpretation • Quality of input data determines the quality of the output and also determines the effort required for data analysis • Clean data is easier to process and produces high quality output Dept. of Computing Science, University of Aberdeen
Output: Models and Patterns • In general, outputs of data analysis involve representations of abstractions • Models are abstractions that span the whole data set • Global abstractions • Do not model the data generation process which produced the input data • Patterns are abstractions that span portions of a data set • Local abstractions • Output abstractions should be • Simple - such as the linear abstraction computed from the pressure volume data (slide 5) • Global - such as the linear abstraction computed from the pressure volume data (slide 5) • Local abstractions are acceptable in contexts where the user already has a global view of the information • We may not always succeed in fitting global models • We may have to fit models piecewise • Fitting models to subsets (portions) of the input data set • Easy to map to graphical elements in information visualizations (and also to words/phrases in the text sometimes) Dept. of Computing Science, University of Aberdeen
Knowledge Acquisition (KA) • In a conventional exploratory data analysis context, data analysis is a bottom up process or data driven • In our case, data analysis is a top down process or goal driven • Knowledge acquisition (KA) studies (discussed next) identify required data analysis tasks • For designing the data analysis modules we need • knowledge about the application domain and • Knowledge of the user tasks and user’s informational requirements • KA studies to be performed before the design phase • With experts • With users • Case studies • Exploratory data analysis (EDA) • Prototype development Dept. of Computing Science, University of Aberdeen
KA Techniques • Techniques developed in the expert system community • Think aloud sessions • Direct interviews • Studying examples or case studies • Exploratory Data Analysis (EDA) • To understand the data set using data analysis methods from descriptive statistics • Analytical methods • Graphical methods • You used EDA in practical 1 Dept. of Computing Science, University of Aberdeen
Identification of data analysis tasks • KA studies normally produce a list of queries user wants to ask the system such as • What is the typical value in the data set? • What are the outliers? • What is the relationships among the various data items? • What are the portions of the data that fit a given pattern? • What is the model that describes the data? • Queries suggest the required data analysis tasks • System response to queries (individual or grouped) can be viewed as messages about the underlying data • Please note messages can be realized either using graphics or using text Dept. of Computing Science, University of Aberdeen
Simple Example - Analysis of exam marks data • Simple questions to be answered in this case: • What are the maximum and minimum marks? • What is the class average or standard deviation? • Frequency counts • How many failed the exam? • How many got first class? • On which of the questions students performed well/not well? • And so on • Answers to the above questions are the different messages in this application • In this case, the different data analysis tasks are: • compute maximum, • Compute minimum, • Compute average • And other statistics • We can also work out questions users ask of a system helping them to understand the world of digital cameras (example from the introduction lecture) Dept. of Computing Science, University of Aberdeen
Design of Data Analysis Module • Main Steps in the design process • Perform KA studies • Identify the HCI (Human Computer Interaction) features of the user’s interaction with the full system • Single view of output • Interactive views of output • Identify required data analysis tasks from KA studies • For each of the tasks design a data analysis method • Decide about how these methods are controlled • Pipeline or • More sophisticated architectures • If the user wants to interact with the system freely (Loose coupling) • If the user wants to interact with the system according to an HCI theory (Theory driven) Dept. of Computing Science, University of Aberdeen
Design of Data Analysis Module (II) • Consider the contextual effects of other tasks • Each method works in the context of other methods related to other tasks • Unknown territory – more studies required here • Optionally design • A pre-processing method for preparing the raw input data for data analysis • a post-processing method that organizes the results of data analysis as required by the infovis module • Cycle through the above design steps many times • Evaluating the design at the end of every cycle • The above procedure relies a lot on the information from KA studies and evaluation studies • Quality of KA and evaluations is important • KA and evaluations are the hardest tasks of the system building activity Dept. of Computing Science, University of Aberdeen
Evaluation • Independent evaluation of the data analysis module • Using known metrics such as precision and recall • Evaluation of the data analysis module in the context of the whole system • New metrics required to measure the goodness of the system as a whole • Metrics may vary with improvements in technology • Task (user) based evaluations • Studies later during the course • Evaluations are costly • Multiple cheap evaluations often better than one expensive evaluation Dept. of Computing Science, University of Aberdeen
Design of Data Analysis Methods • For each identified data analysis task we need a data analysis method • The actual procedure or algorithm that achieves the task • Data analysis methods are developed in many fields: • Statistics • Data Mining • Pattern Recognition • Machine Learning • We reuse methods developed in the above fields • Sometimes, we assume a library of data analysis methods • Such as R/MatLab Dept. of Computing Science, University of Aberdeen
Statistics • Time tested techniques for primary data analysis • Most of the data analysis tasks in our exam marks example can be achieved by statistical techniques • Two types of techniques • Numerical • Compute statistics such as mean and standard deviation • Good at computing objective and precise descriptions of data • Graphical • Create histograms, stem and leaf displays and box plots • Good at presenting (communicating) the data to humans • Note: Statisticians exploited the power of combining numerical techniques (data analysis) and graphical techniques (infovis) • Work great for analysing smaller data sets - hundreds and thousands of data items not millions and billions • Need for data analysis techniques that process large data sets – millions and billions of data items • Algorithmic implementations of many statistical procedures are available in the form of libraries (for example R) Dept. of Computing Science, University of Aberdeen
Data Mining • Data driven techniques for discovering unsuspected and useful patterns or models from very large data sets - Mega and Giga bytes • Mainly used for secondary analysis of data often without any specific goal • Pure statisticians might call ‘data fishing’ • Largely made up of existing statistical ideas scaled up! • Data Mining does not replace humans • Data mining offers tools to perform data analysis • Like all tools quality of results of data mining depends on the skill of the user • Increases the productivity of the user • User should be good at • Statistics • Computer Science and • Domain knowledge Dept. of Computing Science, University of Aberdeen
Pattern Recognition • Techniques for solving perceptual problems • Image Processing • Speech Processing • In this course we are concerned with simple patterns such as rapid ascent in a scuba dive profile • We will design our own simple pattern detection methods • But in general pattern recognition methods are part of our technology Dept. of Computing Science, University of Aberdeen
Machine Learning • Data analysis techniques for automated learning • Usually the output of learning used by machines not humans • Not studied here (As part of CS5565 Data Mining) Dept. of Computing Science, University of Aberdeen
Reusing data analysis methods • Data analysis methods are normally designed in an idealized mathematical context • The user of these methods (such as R/MatLab methods) is expected to know how to map information from real contexts to this idealized mathematical context • As a result, while reusing data analysis methods we need to map information from our context to the idealized context and map the results back from the idealized context to our context • Note that we use data analysis in a HCI context • This also means designing a data analysis module involves • a search for a method in a library of methods (such as R/MatLab) with a good match between • requirements due to HCI context and • known features of the data analysis methods • Adapt an existing method to suit the user requirements Dept. of Computing Science, University of Aberdeen
Requirements due to HCI • Interactivity – communication using information visualizations (studied later in the course) are essentially interactive • Here, communicating the system’s internal context to the user is important • Multi-modality – based on the abilities or disabilities of users • Gaps in communication – depending on the output modality certain abstractions might be hard to communicate • Users’ informational requirements • level of expertise or prior knowledge • Output size restrictions • limited screen size etc. Dept. of Computing Science, University of Aberdeen
Features of Data Analysis Methods • Configurability • Data analysis methods use parameters that allow users to configure its runtime behaviour • These parameters may not be suitable from the communication perspective • Users may not always be able to specify these parameters accurately • When an ideal fit of parameters is not available we modify these methods with the parameterisation suitable to our contexts • Or look for approximate fits • Level of Abstraction • Data analysis methods abstract the raw input data as stated earlier into either global models or local patterns • The level of abstraction achieved has important consequences for the infovis module • Because the level of abstraction determines the level of detail in the final output • The level of abstraction should be determined by the end-user tasks and end-user’s informational requirements • Size of the final output • One of the major constraints on the design of data analysis module is the size of the output produced by the whole system • Users do not prefer • Large complicated graphics or • Large volumes of text • Again, the user size requirements should determine how much information is computed by the data analysis method Dept. of Computing Science, University of Aberdeen
Many alternative methods • Data Mining community develops multiple methods for achieving the same data analysis task • When several data analysis methods are available, a method that generates abstractions which satisfy user requirements should be preferred • When an ideal fit is not available we choose the method that achieves the best result and make alternatives available for exploration • Making the exploration of multiple methods user friendly is challenging • For complex methods users need a black-box view • For simpler methods users require a glass-box view • We could also implement an adapted version of an existing method • Users could be offered features to control the adaptation process Dept. of Computing Science, University of Aberdeen
Data types • Input data can be of different types: • Single variable • Multi-variable • Time series or • Spatial • And others • Data Analysis methods depend upon the type of the input data • In this course we focus on analysis of • Time series data and • E.g. Scuba dive profile data • Spatial data • E.g georeferenced census data Dept. of Computing Science, University of Aberdeen
Summary • Data analysis module needs to be integrated to the infovis module • Data analysis methods need to be controlled by the user • The success of user control depends upon the success of KA • Better understanding of overall system requirements and its operational context leads to better design • Designing data analysis module involves finding the best possible match between • Requirements due to HCI context and • Features of data analysis methods • In this course we focus on analysis of • Time series data and • Spatial data Dept. of Computing Science, University of Aberdeen