220 likes | 346 Views
Lecture 2 – Modern Statistical Modeling, an Overview. Rice ELEC 697 Farinaz Koushanfar Fall 2006. Summary. A little bit of history The culture of statistical modeling Classic Modern Exploratory data analysis Exploratory vs. confirmatory Examples. A little bit of history.
E N D
Lecture 2 – Modern Statistical Modeling, an Overview Rice ELEC 697 Farinaz Koushanfar Fall 2006
Summary • A little bit of history • The culture of statistical modeling • Classic • Modern • Exploratory data analysis • Exploratory vs. confirmatory • Examples
A little bit of history • Statistics is the science of learning from data to understand its meaning, structure, relationships, etc – 100s of years • For a brief history of pre-20th centruty statistics check: http://www.bized.ac.uk/timeweb/reference/statisticians.htm • Statistics as an independent discipline has started to separate from math ~70 years ago • Like many other disciplines in science and engineering, statistics has undergone a major revolution in the past 30 years • Earlier, most data was collected manually, and we were dealing with small data sets. Now, we have terabits storage data bases that we like to capture and model
The scientists behind what I will talk about today… • An appropriate answer to the right problem is worth a good deal more than an exact answer to an approximate problem – John Tukey • Wrote his PhD thesis in Convergence and Uniformity in topology at Princeton (1939) • Recognized the importance of statistics during the World War II • Mathematics is just a tool to facilitate addressing sound problems • Many contributions including fast Fourier transforms, Jackknife, exploratory data analysis John Wilder Tukey 1915-2000 John W. Tukey, We Need Both Exploratory and ConfirmatoryThe American Statistician, Vol. 34, No. 1 (Feb., 1980), pp. 23-25
The scientists behind what I will talk about today… • PhD in math in 1954 at Berkeley • Became a Prof. of probability at UCLA math dept. • Left in 1967 – realized that abstract mathematics has very little to do with real life • Wrote a book, independent consultant for 13 years • Finally could solve interesting and important real-world problems!! • Got a Berkeley position in 1980, this time to help fund the right department for him Leo Breiman 1928-2005 Leo Breiman, Statistical Modeling: The Two CulturesStatistical Science, Vol. 16, No. 3 (Aug., 2001), pp. 199-215
The culture of statistical modeling • Statistics really starts with data • Two main goals • Prediction (estimation) • Information (detection) • Two different cultures • Stochastic models, e.g., response var = f(predictor var, random noise, parameters), model selection, prediction, evaluation (classic) • Algorithmic models, the relating function is an algorithm that operates on the input x to predict the response y (modern) x y Nature
Breiman argues that, The focus on classical data models and ignorance of modern methods has: • Led to irrelevant theory and questionable scientific conclusions • Kept statisticians back from using more suitable models • Has prevented the classical statisticians from working on exciting new problems • In this course, we will cover more of classics and a few modern methods
Back to the history • Upon his return to academia, Breiman realized that all articles (at the time) begin and end with data models • Data models has had success in analyzing the data and getting information about producing data • Misuse of data models has lead to many questionable conclusion about the underlying system • Algorithmic models are mostly developed in the machine learning community • Modern learning has lead to changes in perception!
The model becomes the truth! • Invent or use a reasonably good parametric class of models for a complex mechanism • Estimate the parameters and draw conclusions: • The conclusions are about the model’s mechanism and not about the nature mechanism • If the model is a poor estimation of the nature, the conclusions are wrong! • Example: • Assume that the data is iid following the above model • The coefficients {bm} are to be estimated, N(0,2) • Tests of hypothesis, confidence intervals, distribution of residual sum of squares, etc. • Thousands of articles are published on related proofs • Conclusions drawn ignoring that models are valid
More problems with classical data models • Multiplicity of data models • Answering the question of which model is the best • Each model gives a different picture of the reality and leads to different conclusions • Predictive accuracy • This is a function of the number of parameters used so is not a good measure alone • Other limitations of data models (next slide)
Limitations of data models • Multivariate analysis is just not working • Nobody really believes in multivariate Normal, but everybody uses it • If all a man has is a hammer, then every problem looks like a nail... As data becomes more complex the simplicity of model-based approach diminishes • Approaching the problem by looking for a data model restricts statisticians from dealing with more interesting and realistic problems
Algorithmic models • Have been around for some time, pioneers among statisticians, include Olshen, Friedman, Wahba, Zhang, and Singer • Many new problems have been attacked, including speech, image, and handwriting recognition, nonlinear time series, financial market predictions • Shift from data models to the properties of the algorithms • Characterizing convergence and complexity • Example: Vapnik constructed informative bounds on the generalization error of classification algorithm that depends on the capacity of the algorithm! (Support Vector Machines)
Examples of recent advances • Multiplicity of good models (Rashomon) • Bagging is a solution • The conflict between simplicity and accuracy (Occam) • Occam dilemma: accuracy requires more complex prediction. Simple and interpretable functions do not make accurate predictors • Dimensionality – curse or blessing (Bellman) • How to extract and put together many small pieces of information
Breiman concluding remarks • Nowhere it is written on stone what kind of models should be used • Breiman is not against data models, but he thinks the emphasis has to be on the problem and not model • Find a way to manage complex environments • E.g. microarray data, Internet traffic, ad-hoc network complexity, ULSI variability, etc • The root of science is to check theory against reality • Need this philosophy to address real-world problems
Exploratory data analysis (EDA) • Analysis can be done by various techniqes • Mathematical • Logical • Tabular • Graphical • …
EDA • EDA mostly uses graphical techniques, but • It is really a different philosophy to approach the problem • Differs from classical methods, also referred to by confirmatory data analysis (CDA)
CDA A general problem to explore Collects some data Makes a hypothesis on the models Carries out an analysis of the data based on models Draws conclusions based on the model features EDA A general problem to explore Collects some data Carries out an analysis of the data Infers a model that is appropriate Draws conclusions based on the data features EDA vs. CDA
EDA vs. CDA (Cont’d) • Rigor • CDA is rigorous, formal and objective • EDA is suggestive, subject to analysts view • Data treatment • In CDA few numbers summarize data properties • In EDA all data is in focus • Assumptions • In CDA one discovers statistically significant variations from the assumed model, assuming it was correct • In EDA, the assumptions are few, analysis of data has priority
Why Exploratory Data Analysis? • EDA is oriented toward the future, rather than the past • Utilize data to understand, rather than summarize • Really important in research • A good feel for data is invaluable • Gain insights into the process behind the data • To understand what is NOT in the data • Can (almost) only be obtained by graphical techniques • Graphs give information that no number can replace • Rely on human ability to recognize patterns and to compare
Typical Assumptions for Measurement Process • The data from a process is: • Random drawings (one data point should not influence the other) • From a fixed distribution (an thus generalizable) • The distribution has a fixed location (the expectation is fixed) • And a fixed variation (the way the data differs from the expectation is fixed) • We measure mean and variance to asses the last two assumptions
EDA Techniques • Plot a lot of aspects of the data in a variety of techniques, including scatter plots, barplots, histograms, pie charts, and factor plots • E.g. Run sequence plot for mean and variance assumptions • All values of yi are plotted on a chart where the y-axis is yi against the index i (x-axis) • Graphically check the fixed location • Graphically check the fixed variations
Example- EDA • Run sequence plot (compare the two)