Statistical Hypothesis Testing (8 th Session in “Gentle Introduction to Modeling Uncertainty”)

Statistical Hypothesis Testing(8th Session in “Gentle Introduction to Modeling Uncertainty”) Lonnie Chrisman, Ph.D.Lumina Decision Systems Analytica User Group 15 July 2010

Scope of Today’s Webinar Included: • Conceptual underpinnings of classical hypothesis testing. • Interpretation of statistical significance (p-values). • General methodology for applying it in any scenario. • Intended to promote conceptual understanding. • Building on Monte Carlo tools. Not included: • Standard canned hypothesis tests (like t-tests, etc)

Outline • Motivating example • Statistical significance • The Statistic • Methodology • Modeling the Null hypothesis • Computing the pValue • Interpretation of results • Drawbacks of methodology • Additional exercise

Does Stock Market VolatilityVary with Day of Week? • Random selected 100 trading days (from 2000-2010). • Computed day change (close-open)/open for S&P 500 index. Side note: Annualized volatility := SDeviation * sqrt(T)where T = # trading days/yr = 250 Total volatility: 18.1% • Alice: “This shows that the market volatility does depend on the day of the week.” • Bob: “No, the variation is just due to random sampling variation.”

Download Model with S&P Data • Please download: “Hypothesis Test S&P Volatility.ana”the download link is at the bottom of talk abstract on Analytica Wiki. • You’ll use this data for exercises…

Statistical Significance • Alice: “This shows that the market volatility depends on the day of the week.” • Alice’s mission: To show that this observed variation is unlikely if it is just due to random sampling variation. • Null Hypothesis: The “true” underlying volatility is the same for every day of the week. • Level of significance: The probability that this much variation in volatility would be observed if the Null Hypothesis is true. (termed the “p-value”)

Statistical Significance #2 • After her statistical analysis, Alice might say:“This shows at a significance level p=3% that market volatility varies with the day of the week.” • By convention, p ≤ 5% is usually considered to be “statistically significant”. p>5% is said to be “not statistically significant”. • What can you conclude if the p-value turns out to be 20%?

The “Statistic” Total volatility: 18.1% • We need a scalar metric to summarize degree of conflict with Null-hypothesis (H0). • Smaller value more consistent with H0 • Larger value greater disagreement with H0 • Examples: • Max(vol,day) – Min(vol,day) • SDeviation(vol,day) • F = Variance(vol,day) / Total_volatility^2 • Exercise: Pick a statistic and compute its value for the S&P 500 dataset in your Analytica model.

Methodology • Construct a model that simulates measurements given that the null-hypothesis is true. • Typically makes various assumptions. • Use Monte Carlo simulation to produce several simulated data sets. Apply the statistic to each. • pValue: Pr( Statsim ≥ Statmeas )

Modeling the Null Hypothesis • Null Hypothesis: The volatility is 18.1% on every day of the week. • How could you simulate the data?(Hint: There are multiple possible approaches) • What assumptions are you making? • Some ideas: • Randomly generate each day’s price change from a LogNormal distribution. • Shuffle existing data. • Exercise: Implement a model of the null-hypothesis in your Analytica model. (One random dataset for each item in Run) Total volatility: 18.1%

Computing Statistic on Simulated • Exercise: Apply your statistic to each simulated dataset. • Note: Larger statistic values occur when the variation in volatility by day is largest. • Exercise: What fraction of simulated datasets have a larger statistic value than the actual data? • This is the p-value • Is Alice’s hypothesis statistically significant?

Common Misuse of Paradigm:Multiple Hypotheses • Scenario: • Alice identifies 20 other plausible hypotheses to test, e.g.: • Volatility on Tues is different than the other 4 days. • Volatility varies my month. • September has a higher volatility than other months. • … • She tests each of these individually and finds one of them to be statistically significant at a 5% level. • She publishes this result. • What’s wrong here? • What should she do differently?

Interpreting p-Value • Small value (< 5%) • Accept main hypothesis • Data is inconsistent with Null-hypothesis • Otherwise (p > 5%) • Conclude only that data sample was too small to detect relationship. • Hypothesis may still be true or false:“Larger research study required” • P-value is not: • A measure of the strength of relationship. • The probability that the hypothesis is true.

Drawbacks with Statistical Hypothesis Testing Paradigm • 1 in 20 false hypotheses are accepted (at 5% significance level). • Often abused by people testing many hypotheses. • Nearly any hypothesis is confirmed with a large enough sample. • Most hypotheses will have at least a miniscule “true” effect. • With enough data, even the most miniscule effect becomes statistically significant. • The “uncertainty” about the hypothesis is not available. Doesn’t provide P(H), which would be useful in model that use the results. • Numerous subjective components that are not recognized or reported explicitly. • “Cookbook tests” are very often misapplied when assumptions don’t hold, leading to greater confidence than is warranted by the data.

New Exercise Number of subjects: (purely fictional data) • Hypothesis: TCE exposure is associated with an increased risk of getting Parkinson’s disease. • Null Hypothesis: • Parkinson’s rates are the same among those exposed and not exposed to TCE. • Exercise: • Identify an appropriate statistic. • Model the null-hypothesis • Compute the p-Value

Summary • Statistical Hypothesis Testing tests: • Is the support for a hypothesis statistically significant given a dataset. • Significance level (p-value) is: • Probability of seeing data at least as extreme as the actual data when the Null hypothesis is true. • p-value <= 5%  accept hypothesisp-value > 5%  conclude nothing, need more data. • Methodology: • Identify statistic (scalar metric): A measure of divergence from null-hypothesis. • Build model of null-hypothesis to “simulate” data sets. • Compute p-value.

Statistical Hypothesis Testing (8 th Session in “Gentle Introduction to Modeling Uncertainty”)