Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, January 28, 2014, SAGE 3101

Admin info (keep/ print this slide) • Class: ITWS-4963/ITWS 6965 • Hours: 12:00pm-1:50pm Tuesday/ Friday • Location: SAGE 3101 • Instructor: Peter Fox • Instructor contact: pfox@cs.rpi.edu, 518.276.4862 (do not leave a msg) • Contact hours: Monday** 3:00-4:00pm (or by email appt) • Contact location: Winslow 2120 (sometimes Lally 207A announced by email) • TA: Lakshmi Chenicheri chenil@rpi.edu • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014 • Schedule, lectures, syllabus, reading, assignments, etc.

Contents • Back to the data sources • Cyber • Human • “Munging” • Beginning with hypothesis -> synthesis • Distributions… • Scoping out analysis and model choices

Lower layers in the Analytics Stack

“Cyber Data” …

“Human Data” …

Descriptive / Inferential • Descriptive statistics: numerical summaries ofsamples • i.e., what was observed, distributions • The‘sample’ may beexhaustive, i.e., identical to the population • Inferential statistics: from samples to populations • i.e., what could have been or will be observed in a largerpopulation • Descriptive (report) to Inferential (model suggestion) is a key process in analytics • So often NOT a linear process.. • Sample bias – choice and awareness Adapted from Marshall Ma (and other sources) 7

Populations and samples • A populationis defined • We must be able to say, for every object, if it is in the population or not • We must be able, in principle, to find every individual of the population A geographic example of a population is all pixels in a multi-spectral satellite image • A sampleis a subset of a population • We must be able to say, for every object in the population, if it is in the sample or not • Sampling is the process of selecting a sample from a population • E.g 2010EPI_data.xls (EPI2010_all countries or EPI2010_onlyEPIcountries tabs) 8

Election prediction • Exit polls versus election results • Human versus cyber • How is the “population” defined here? • What is the sample, how chosen? • What is described and how is that used to predict? • Are results categorized? (where from, M/F, age) • What is the uncertainty? • It is reflected in the “sample distribution” • And controlled/ constraints by “sampling theory”

Bias difference: between cyber and human data • Election results and exit polls • What are examples of bias in election results? • In exit polls?

Hypothesis • What are you exploring? • Regular data analytics features ~ well defined hypotheses • Big Data messes that up • E.g. Stock market performance / trends versus unusual events (crash/ boom): • Populations versus samples – which is which? • Why? • E.g. Election results are predictable from exit polls

Distributions • http://www.quantitativeskills.com/sisa/rojo/alldist.zip • Shape • Character • Parameter(s)

Plotting these distributions • Histograms and binning • Getting used to log scales • Going beyond 2-D • More of this on Friday (in more detail)

In applications • Scipy: http://docs.scipy.org/doc/scipy/reference/stats.html • R: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Distributions.html • Matlab: http://www.mathworks.com/help/stats/_brn2irf.html • Excel: HAH!

Heavy-tail distributions • are probability distributions whose tails are not exponentially bounded • Common – long-tail… human v. cyber… Few that dominate More that add up Equal areas http://en.wikipedia.org/wiki/Heavy-tailed_distribution

Spatial example

Spatial roughness…

Compare median, mean, mode

Huh, we have Big Data? • Why would we care about samples? • Let’s take it all? • It gets messy == quality, gaps, … • Very often goes beyond known patterns, i.e. out of the range of previous values • Anyone remember the financial crisis in 2008? • Data becomes more subjective than objective and especially human v. cyber.. • To start: let’s take a look at EPI data that you started to explore last week (cyber)

Munging • Missing values, null values, etc. • E.g. in EPI_data – they use “--” • Most data applications provide built ins for these higher-order functions – in R “NA” is used and functions such as is.na(var), etc. provide powerful filtering options (we’ll cover these on Friday) • Of course, different variables often are missing “different” values • In R – higher-order functions such as: Reduce, Filter, Map, Find, Position and Negate will become your enemies and then friends: http://www.johnmyleswhite.com/notebook/2010/09/23/higher-order-functions-in-r/

Patterns and Relationships • Stepping from elementary/ distribution analysis to algorithmic-based analysis • I.e. pattern detection via data mining: classification, clustering, rules; machine learning; support vector machines, non-parametric models • Relations – associations between/among populations • Outcome: model and an evaluation of its fitness for purpose

More munging • Bad values, outliers, corrupted entries, thresholds … • Noise reduction – low-pass filtering, binning • A few example today but the labs will bring this into view soon • REMEMBER: when you munge you MUST record what you did (and why) and save copies of pre- and post- operations…

Populations within populations • In the EPI example: • Geographic regions (GEO_subregion) • EPI_regions • Eco-regions (EDC v. LEDC – know what that is?) • Primary industry(ies) • Climate region • What would you do to start exploring?

Or, a twist – n=1 but many attributes? The item of interest in relation to its attributes

Summary: explore • Going from preliminary to initial analysis… • Determining if there is one or more common distributions involved – i.e. parametric statistics (assumes or asserts a probability distribution) • Fitting that distribution • Or NOT • A hybrid or • Non-parametric (statistics) approaches are needed – more on this to come

Models • Assumptions are often used when considering models, e.g. as being representative of the population – since they are so often derived from a sample – this should be starting to make sense (a bit) • Two key topics: • N=all and the open world assumption • Model of the thing of interest versus model of the data (data model; structural form) • “All models are wrong but some are useful”(generally attributed to the statistician George Box)

Conceptual, logical and physical models However our models will be mathematical, statistical, or a combination. The concept of the model comes from the hypothesis The implementation of the physical model comes from the data ;-) Applied to a database:

Art or science? • The form of the model, incorporating the hypothesis determines a “form” • Thus, as much art as science because it depends both on your world view and what the data is telling you (or not) • We will however, be giving the models nice mathematical properties; orthogonal/ orthonormal basis functions, etc…

Goodness of fit • And, we cannot take the models at face value, we must assess how fit they may be: • Chi-Square • One-sided and two-sided Kolmogorov-Smirnov tests • Lilliefors tests • Ansari-Bradley tests • Jarque-Bera tests • Just a preview…

Summary • Cyber and Human data; quality, uncertainty and bias • Distributions – the common and not-so common ones and how cyber and human data can have distinct distributions • How simple statistical distributions can mis-lead us • Populations and samples and how inferential statistics will lead us to model choices (no we have not actually done that yet in detail) • Big Data and some consequences • Munging toward exploratory analysis • Toward models!

Tentative assignments • Assignment 2: Datasets and data infrastructures – lab assignment. Held in week 3 (Feb. 7) 10% (lab; individual); • Assignment 3: Preliminary and Statistical Analysis. Due ~ week 4. 15% (15% written and 0% oral; individual); • Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ week 5. 15% (10% written and 5% oral; individual); • Assignment 5: Term project proposal. Due ~ week 6. 5% (0% written and 5% oral; individual); • Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual); • Term project. Due ~ week 13. 30% (25% written, 5% oral; individual).

How are the software installs going? • R/Scipy (et al)/Matlab • Data infrastructure • Exercises? • More on Friday…

Assignment 1 – how is it going? • Choose a DA case study from a) readings, or b) your choice (must be approved by me) • Read it and provide a short written review/ critique (business case, area of application, approach/ methods, tools used, results, actions, benefits). • Be prepared to discuss it in class this Friday 31st. Hand in the written report by 5pm that day.

Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Presentation Transcript

Career Information and Resources

Opinions Extraction and Information Synthesis

Role Models and Emotional Resources

Information Resources and Technology

Information Research and Resources

Layoff Information and Resources

Resources and Contact Information

Layoff Information and Resources

Hypothesis Testing, Synthesis

Data Integration, Analysis, and Synthesis

Information Resources and Technology

Layoff Information and Resources

Data and Information

Information and Data

Data and information

Hypothesis and Testing of Hypothesis

Data and Information

ROLE OF INFORMATION TECHNOLOGY AND COMMUNICATIONS

Data and Information Resources, Role of Hypothesis, Exploration and Distributions

Data and Information

Hypothesis and Hypothesis Testing

GCSE Subject Choices and Careers Information