400 likes | 542 Views
Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices. Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, January 28, 2014, SAGE 3101. Admin info (keep/ print this slide). Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101
E N D
Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, January 28, 2014, SAGE 3101
Admin info (keep/ print this slide) • Class: ITWS-4963/ITWS 6965 • Hours: 12:00pm-1:50pm Tuesday/ Friday • Location: SAGE 3101 • Instructor: Peter Fox • Instructor contact: pfox@cs.rpi.edu, 518.276.4862 (do not leave a msg) • Contact hours: Monday** 3:00-4:00pm (or by email appt) • Contact location: Winslow 2120 (sometimes Lally 207A announced by email) • TA: Lakshmi Chenicheri chenil@rpi.edu • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014 • Schedule, lectures, syllabus, reading, assignments, etc.
Contents • Back to the data sources • Cyber • Human • “Munging” • Beginning with hypothesis -> synthesis • Distributions… • Scoping out analysis and model choices
Descriptive / Inferential • Descriptive statistics: numerical summaries ofsamples • i.e., what was observed, distributions • The‘sample’ may beexhaustive, i.e., identical to the population • Inferential statistics: from samples to populations • i.e., what could have been or will be observed in a largerpopulation • Descriptive (report) to Inferential (model suggestion) is a key process in analytics • So often NOT a linear process.. • Sample bias – choice and awareness Adapted from Marshall Ma (and other sources) 7
Populations and samples • A populationis defined • We must be able to say, for every object, if it is in the population or not • We must be able, in principle, to find every individual of the population A geographic example of a population is all pixels in a multi-spectral satellite image • A sampleis a subset of a population • We must be able to say, for every object in the population, if it is in the sample or not • Sampling is the process of selecting a sample from a population • E.g 2010EPI_data.xls (EPI2010_all countries or EPI2010_onlyEPIcountries tabs) 8
Election prediction • Exit polls versus election results • Human versus cyber • How is the “population” defined here? • What is the sample, how chosen? • What is described and how is that used to predict? • Are results categorized? (where from, M/F, age) • What is the uncertainty? • It is reflected in the “sample distribution” • And controlled/ constraints by “sampling theory”
Bias difference: between cyber and human data • Election results and exit polls • What are examples of bias in election results? • In exit polls?
Hypothesis • What are you exploring? • Regular data analytics features ~ well defined hypotheses • Big Data messes that up • E.g. Stock market performance / trends versus unusual events (crash/ boom): • Populations versus samples – which is which? • Why? • E.g. Election results are predictable from exit polls
Distributions • http://www.quantitativeskills.com/sisa/rojo/alldist.zip • Shape • Character • Parameter(s)
Plotting these distributions • Histograms and binning • Getting used to log scales • Going beyond 2-D • More of this on Friday (in more detail)
In applications • Scipy: http://docs.scipy.org/doc/scipy/reference/stats.html • R: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Distributions.html • Matlab: http://www.mathworks.com/help/stats/_brn2irf.html • Excel: HAH!
Heavy-tail distributions • are probability distributions whose tails are not exponentially bounded • Common – long-tail… human v. cyber… Few that dominate More that add up Equal areas http://en.wikipedia.org/wiki/Heavy-tailed_distribution
Huh, we have Big Data? • Why would we care about samples? • Let’s take it all? • It gets messy == quality, gaps, … • Very often goes beyond known patterns, i.e. out of the range of previous values • Anyone remember the financial crisis in 2008? • Data becomes more subjective than objective and especially human v. cyber.. • To start: let’s take a look at EPI data that you started to explore last week (cyber)
Munging • Missing values, null values, etc. • E.g. in EPI_data – they use “--” • Most data applications provide built ins for these higher-order functions – in R “NA” is used and functions such as is.na(var), etc. provide powerful filtering options (we’ll cover these on Friday) • Of course, different variables often are missing “different” values • In R – higher-order functions such as: Reduce, Filter, Map, Find, Position and Negate will become your enemies and then friends: http://www.johnmyleswhite.com/notebook/2010/09/23/higher-order-functions-in-r/
Patterns and Relationships • Stepping from elementary/ distribution analysis to algorithmic-based analysis • I.e. pattern detection via data mining: classification, clustering, rules; machine learning; support vector machines, non-parametric models • Relations – associations between/among populations • Outcome: model and an evaluation of its fitness for purpose
More munging • Bad values, outliers, corrupted entries, thresholds … • Noise reduction – low-pass filtering, binning • A few example today but the labs will bring this into view soon • REMEMBER: when you munge you MUST record what you did (and why) and save copies of pre- and post- operations…
Populations within populations • In the EPI example: • Geographic regions (GEO_subregion) • EPI_regions • Eco-regions (EDC v. LEDC – know what that is?) • Primary industry(ies) • Climate region • What would you do to start exploring?
Or, a twist – n=1 but many attributes? The item of interest in relation to its attributes
Summary: explore • Going from preliminary to initial analysis… • Determining if there is one or more common distributions involved – i.e. parametric statistics (assumes or asserts a probability distribution) • Fitting that distribution • Or NOT • A hybrid or • Non-parametric (statistics) approaches are needed – more on this to come
Models • Assumptions are often used when considering models, e.g. as being representative of the population – since they are so often derived from a sample – this should be starting to make sense (a bit) • Two key topics: • N=all and the open world assumption • Model of the thing of interest versus model of the data (data model; structural form) • “All models are wrong but some are useful”(generally attributed to the statistician George Box)
Conceptual, logical and physical models However our models will be mathematical, statistical, or a combination. The concept of the model comes from the hypothesis The implementation of the physical model comes from the data ;-) Applied to a database:
Art or science? • The form of the model, incorporating the hypothesis determines a “form” • Thus, as much art as science because it depends both on your world view and what the data is telling you (or not) • We will however, be giving the models nice mathematical properties; orthogonal/ orthonormal basis functions, etc…
Goodness of fit • And, we cannot take the models at face value, we must assess how fit they may be: • Chi-Square • One-sided and two-sided Kolmogorov-Smirnov tests • Lilliefors tests • Ansari-Bradley tests • Jarque-Bera tests • Just a preview…
Summary • Cyber and Human data; quality, uncertainty and bias • Distributions – the common and not-so common ones and how cyber and human data can have distinct distributions • How simple statistical distributions can mis-lead us • Populations and samples and how inferential statistics will lead us to model choices (no we have not actually done that yet in detail) • Big Data and some consequences • Munging toward exploratory analysis • Toward models!
Tentative assignments • Assignment 2: Datasets and data infrastructures – lab assignment. Held in week 3 (Feb. 7) 10% (lab; individual); • Assignment 3: Preliminary and Statistical Analysis. Due ~ week 4. 15% (15% written and 0% oral; individual); • Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ week 5. 15% (10% written and 5% oral; individual); • Assignment 5: Term project proposal. Due ~ week 6. 5% (0% written and 5% oral; individual); • Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual); • Term project. Due ~ week 13. 30% (25% written, 5% oral; individual).
How are the software installs going? • R/Scipy (et al)/Matlab • Data infrastructure • Exercises? • More on Friday…
Assignment 1 – how is it going? • Choose a DA case study from a) readings, or b) your choice (must be approved by me) • Read it and provide a short written review/ critique (business case, area of application, approach/ methods, tools used, results, actions, benefits). • Be prepared to discuss it in class this Friday 31st. Hand in the written report by 5pm that day.