750 likes | 765 Views
PDI Data Literacy: Busting Myths of Big Data: Part2. Nairanjana (Jan) Dasgupta Professor, Dept. of Math and Stats Boeing Distinguished Professor of Math and Science Director, Center of Interdisciplinary Statistics Education and Research (CISER) Washington State University, Pullman, WA
E N D
PDI Data Literacy: Busting Myths of Big Data: Part2 Nairanjana (Jan) Dasgupta Professor, Dept. of Math and Stats Boeing Distinguished Professor of Math and Science Director, Center of Interdisciplinary Statistics Education and Research (CISER)Washington State University, Pullman, WA Email:dasgupta@wsu.edu
Part 1: what we have covered in September 2018 • Types of Data • Population versus Sample • Exploratory and Confirmatory studies • Experiment versus observational studies • Distinction: Uni-variate, Bi-variate, Multi-variate, multiple • Graphical/Numerical Summary of data • Measures of Center and Spread • Measures of the Dimensionality
Synopsis of Part 1 • It matters what TYPE of data we have. • It matters how the data were collected • It matters whether we have a population or a sample • It matters if you randomized the process of data collection • If the population is studied all you need to do is to summarize, with a sample we need to think of inference. • If we are really dealing with a population when we talk about big data: then all we need to do is visualize and summarize. No inference required.
Synopsis continued • Use Pie charts, bar graphs for visualizing univariate categorical data • Use box plots, histograms for univariate numerical data • For bivariate data we can do scatter plots • For multivariate data we can do clusters • For numerical data use mean, median as measures of center • For categorical use mode • Use standard deviation, iqr for spread for numerical data • For categorical use the frequency plot or tables to summarize it • Summarization allows us to make sense of raw data and is crucial before we analyze it.
Synopsis continued • Population versus sample what do we have data on? • Big data is often opportunistic data and so an extreme form of observational study. • However, if we assume we have a sample and we want to answer questions about a population we need to think inference. • Today’s discussion will be on inference.
Today’s topics: Inference:Making decisions from data: • Going from sample to population • Inference and decision making • Estimation and Intervals • Testing and Confidence Intervals • Errors in testing: Type I and Type II • Power • Statistical significance • P-value — good, bad or misused • ASA’s statement about p-values
Today’s topics continued: Big Data and its pros and cons • What are the advantages of big data • What do we mean by big? Big n or big p • Decision making with big data • Predictive analytics • Back to population versus sample • Overview and recap
Part 2: Section 1 Inference and analysis: Making Decisions from data
When and Why we need inference • IF the data we collected was really a population we do not need to do any inference. • Let us consider the question: what is the average number of statistics classes taken by students before entering graduate school? • If consider the audience – would this be a population or a sample? • Would my current audience be a GOOD sample?
Inference: • Use data and statistical methods to infer (get an idea, resolve a belief, estimate) something about the population based on results of a sample
Estimation • We have no idea at all about the parameter of interest and we use the sample information to get an idea (estimate) the population parameter • Point Estimation: Using a single point to estimate the population parameter • Interval Estimation: We use an interval of values to estimate the population parameter
Hypothesis Testing • We have some idea about a population parameters. • We want to test this claim based on data we collect • Probably one of the most used and most misunderstood method used in science. • This provides us with the “dreaded p-value”.
Parameter • To understand inference, we really need to get a very clear idea about what is a parameter. • By definition: Parameter is a numerical characteristic (or characteristics if multivariate) of the population that is of interest. • Let us go back to our example: • We are interested in the average number of statistics classes students have taken when they come to graduate school here.
Population and Parameter: • Here the population is all graduate students at WSU. • We have to be careful here: is it all CURRENT graduate students or all students past, present and future. • To make matters easy let us say it is CURRENT graduate students. • The parameter is: the average number of statistics classes taken by the students.
Sample and Statistic • Our choices are to do a census and then compute the average from the entire census – this is the parameter • Or to take a sample and calculate the number FROM the sample. • If we use the sample and compute the number from the sample we call the sample average our STATISTIC.
How do we sample • Here we need to think of how we sample this very well defined population. • Thoughts? • Hallmarks of a good sample: representative, unbiased, reliable
Estimation: Complete Ignorance about parameter • If we use the sample statistic to get an idea of the population sample, what we are doing is inference, specifically ESTIMATION • What assures us that the sample statistic will be a good estimate of the population parameter? • This leads us to Unbiasedness and Precision
Point Estimation • The idea of point estimation seems intuitive: we use the sample value for the population value. • The reason we can do this, is because we make certain assumptions about the probability distribution of the sample statistic • Generally we assume that the sampling scheme we pick allows us an unbiased and high precision distribution of the statistic. • If our method is indeed unbiased then the sample mean, should on average be a good estimator for the population mean.
Interval Estimation • Even if we believe in the unbiasedness of our estimator, we still often want an interval rather than just a single value for the estimates. • This allows us to have interval estimation. • This technique takes into account the spread as well as the distribution in the estimation. It gives us an interval of values, in which we feel that our parameter is contained with high confidence.
Confidence Interval: • In general a confidence interval for the population mean is given by: • Sample mean ± margin of error • Question is: how does one calculate “margin of error” • Answer: we need distributions and random variables to do that. • This means some mathematics and probability theory.
Confidence interval • Used quite a bit in the past • Gives similar information as hypothesis tests • Often can be inverted to construct tests • However, theoretically it is quite different, as here we talk about the SIZE of the effect rather than significance of the effect. • With all the bad press that p-values have received this might make a come back.
Hypothesis Testing: some idea about the parameter • We have some knowledge about the parameter • A claim, a warranty, what we would like it to be • We test the claim using our data; • First step: formulating the hypothesis • Always a pair: Research and Nullification of research • (affectionately called Ho and Ha)
How to formulate your hypothesis • First state your claim or research. • Let us say we believe that the average number of stats classes taken by graduate students coming into WSU is greater than 2. • Here our parameter is Population mean of statistics classes taken by graduate students at WSU • Claim: Mean > 2 • What nullifies this? • Mean ≤ 2 (Remember the “=“ always resides with the null)
Logic of Testing • To actually test the hypothesis, what we try to do is to disprove or reject the null hypothesis. • If we can reject the null, by default our Ha (which is our research) is true. • Think of how the legal system works: • H0: Not Guilty • Ha: Guilty
How do we do this? • We take a sample and look at the sample values. Then we see if the null was true, would this be a likely value of the sample statistic. • If our observed value is not a likely value, we reject the null. • How likely or unlikely a value is, is determined by the sampling distribution of that statistic.
Example • In our example we were interested in the hypothesis about the average number of classes taken by incoming graduate students: • H0: µ≤2 • Ha: µ > 2 • If we observed a sample with a mean of 4 and a standard deviation of 1 from a sample of 100 would you consider the null likely?? • How about if the mean was 4 and the standard deviation was 20 from a sample of 100?
Players in decision making • Your observed statistic • Your sample size • Your observed standard deviation • Your capacity of being able to find probabilistically how likely our observed value is under the null.
Errors in testing: • Since we take our decisions about the parameter based on sample values we are likely to commit some errors. • Type I error: Rejecting Ho when it is true (False Positive) • Type II error: Failing to reject H0 when Ha is true (False Negative) • In any given situation we want to minimize these errors. • P(Type I error) = a, Also called size, level of significance. • P(Type II error) = b, • Power = 1-b, HERE we reject H0 when the claim is true. We want power to be LARGE. Power is the TRUE Positive we want.
Example • I am introducing a new drug into the market. The drug may have some serious side effects. Before I do so I will go through tests to see if is effective in curing disease. • H0: not effective • Ha: drug is effective • What is Type I error and Type II error in this case? • Which is worse? More importantly think of the consequence of these errors.
One more example: • Ann Landers in her advice column on the reliability of DNA testing for determining paternity advises, “To get a completely accurate result you would have to be tested, so would the man and your mother.” • Consider the hypothesis: • Ho: a particular man is the father • Ha: a particular man is not the father. • Discuss the chances of probability of Type I and II errors.
Decision Making using Hypotheses: • In general, this is the way we make decisions. • The idea is we want to minimize both Type I and II errors. • However, in practice we cannot minimize both the errors simultaneously. • What is done, is we fix our Type I error at some small level, ie 0.1, 0.05 or 0.01 etc. Then we find the test that will minimize our Type II error for this fixed level of Type I error. This gives us the mostpowerful test. • So in solving a hypothesis problem, we formulate our decision rule using the fixed value of Type I error. The decision rule is also called the CRITICAL VALUE.
How does rejection of null work with Critical values? • First we calculate the value of the sample/test statistic. • Then we look at this value and compare it with the distribution of the sample statistic to allow ourselves Type I error of alpha. • Based on this, if our observed value is beyond our critical value, we feel justified in rejecting the null. • CRITICISM: Choice of alpha is arbitrary. We can make alpha big or small depending on what we want our outcome to be…
P-values: Elephant in the room • Sometimes hypothesis testing can be thought to be subjective. This is because the choice of a-values may alter a decision. Hence it is thought that one should report p-values and let the readers decide for themselves what the decision should be. • p-value or probability value is the probability of getting a value worse than the observed. If this probability is small then our observed is an unlikely value under the null and we should reject the null. Otherwise we cannot reject the null.
P-value for our example • For the hypothesis we talked about earlier: • H0: µ≤2 • Ha: µ > 2 • If we observed a sample with a mean of 4 and a standard deviation of 2 from a sample of 100 would you consider the null likely?? • How about if the mean was 4 and the standard deviation was 20 from a sample of 100? • P-value = P(Z > (4-2)/(2)/sqrt(100) | µ≤2 ) <.001 • P-value = P(Z > (4-2)/(20)/sqrt(100) | µ≤2 ) <.16
Criticism of p-values • As more and more people used p-values and with an effort to guard against the premise that “we can fool some of the people some of the time”, journals started having strict rules about p-value. • To publish you needed to show small p-values. • No SMALL p-values no publication… • This has often led to publication of ONLY significant results. • Also, led to let us get the p-value small by hook or crook attitude.
ASA Statement about p-value • It really tells us how incompatible the data are with a specified statistical model. • P-values do not measure the probability that studied hypothesis is true or that the data were produced by random chance alone. • Scientific conclusions and business policy decisions should not be based on only whether a p-value passes a specific threshold. • Proper inference requires full reporting and transparency • P-value does NOT measure the size of the effect or the importance of the result. • By itself the p-value cannot provide a good measure of evidence regarding the model.
Power: The other elephant in the room • Power is the TRUE positive • In other words what is the probability you would reject the null under a specified value of the alternative. • So first we need to figure out what value of the alternative we choose to calculate the power. This choice is up to us and we often call it the effect size.
Example of power • For the hypothesis we talked about earlier: • H0: µ≤2 • Ha: µ > 2 • If we observed a sample with a mean of 4 and a standard deviation of 2 from a sample of 100. Calculate power when mu=2.5, 3, 3.5, 4 • Power = P (Z> (2.5-2)/2/10 ) = P(Z > 2.5) = .0162 • Etc.
What are the players for power? • Sample size • Effect size • Standard deviation. • So to really calculate power one needs to have data to understand the distribution and have a feel for the standard deviation. Pre-hoc power calculation is often “trying to fool some of the people some of the time”
Recap of Part 2: Section 1 • To make inferences our population of interest and parameter needs to be well defined. • Errors exist in testing and have to be considered • P-values are a measure of incompatibility of the existing data given the null hypothesis and cannot be used to PROVE anything. • To calculate power we need to look into values under the alternative and this can be subjective.
Worksheet for Section 1: • True or False: Type I error is always the worst hence we should focus on controlling it rather than Type II error. • You believe that the average time it takes students to walk from one class to another at WSU is more than the 10 minutes you are allotted. • Write out your null and alternate hypothesis • Write down what the type 1 error would be in this context. • You test it and get a p-value of .13, does this indicate that the null is true?
Part 4: Big data, its pros and cons
What determines big data :The 5 V’s • Volume • Considered too large for regular software • Variety • Often a mix of many different data types • Velocity • Extreme speed at which data generated • Variability • Inconsistency of the data set • Veracity • How reliable is this data
How big is big? • By big we mean its volume is such that it is hard to analyze this on a single computer. • That in itself shouldn’t be problematic • But requiring specialized machines to analyze this has added to the myth and enigma of big data. • The problem with big data, at least as I see it, is some very pertinent statistical questions are bypassed when dealing with it.
Some statistical thoughts? • Is the big data a sample or a population? • If it is really a population: then analysis means constructing summary statistics. • This is bulky but not too difficult. • If it is a sample: what was the sampling frame? • If no population was considered when collecting this data, it is definitely not a representative sample. • So, should one really do inference on BIG data? • If one is allowed to do inference wouldn’t the sheer size of the data, give us so much power that we can pretty much come to any decision we test for.
Structure of data • Generally most data sets are rectangular in nature with p variables and n observations collected. • In big data we often have many more predictors than observations (the big p problem) • Many more (orders of magnitude more) observations than predictors, (the big n problem). • Both n and p are big and are fluid as they are constantly updated and amassed.