Basic Statistics: Conceptual understanding and application

2011 Boot Camp Basic Statistics:Conceptual understanding and application Dae Joong Kim John Glenn School of Public Affairs Ohio State University kim.2769@osu.edu

Introduction Quantitative Analysis Courses Boot Camp: Basic statistics Online: http://glennschool.osu.edu/bootcamp/index.html 820:Data Analysis for Public Policy and Management (Autumn quarter) 822: Multivariate Data Analysis for Public Policy and Management (Spring quarter) 6

Observations (individuals or cases) BASIC STATISTICS Example Discretevariables (Nominal (e.g., sex (male or female), Ordinal scale(e.g., economic status (low, Middle, high)) • Research question: • Relation between weight and height Data Variables = observations’ attributes • The study of the collection, analysis, and interpretation of data related to your research questions or models • Sampling: • Population: JGS master students • Sample: 10 students Continuous variables (Interval scale (e.g., income), Ratio scale (e.g., height; weight)) Collection of Data Sampling w/o outlier • KEY CONDITIONS • Randomness • Representativeness • Surveys • - questionnaire • - interview • Experiments • Sample • Population variables Estimate; infer Level of measurement (scales of measure) Mean: arithmetic average of a set of number Median: the middle obs in a group of data when the data are ranked in order of magnitude Mode: the most common value in any distribution w/ outlier Analysis of Data observations • Descriptive statistics analysis • (Data distribution analysis) • Mean (average; expected utility) • Variance • Standard Deviation • Frequency and percentage, etc. • Inferential statistics analysis • (Relationship analysis between variables: • Explorative analysis or hypothesis testing ) • Correlation (positive, negative, or nothing) • Mean difference analysis • t-test for two groups • Analysis of variance (ANOVA) for multi-groups • Regression(independent variable and dependent variable) • Single regression model • Multiple regression model outlier • KEY CONDITIONS • Normal distribution • Central limit theorem • Analysis: • Mean of weight: • Median of weight: Correlation (r): just relationship between variables (no direction; symmetric relationship) corr(weight, height), -1 ≤ r ≤ 1 Regression: independent relationship of more than one variables with a variable (direction: asymmetric relationship) Weight = height + error, 0 ≤ r2 ≤ 1 Causation: a correlation or regression is not same as causation if it does not satisfy 1) time order between variables (cause-effect), 2) no other variables between them, and 3) direction change at the same time. • Correlation between weight and height: • Linear regression model (OLS) • Non-linear regression model • BASIC ASSUMPTIONS • Linearity; Normality: Homoscedasticity; Independency; Interpretation of Data Output • Interpret data outputs based on your background knowledge and experience • (Different researchers can interpret the same data in different ways) • Support your argument, or your theoretical model based on your interpretation Statistical Hypothesis Testing Null hypothesis (H0): hypothesis that researchers try to disprove Alternative or research) hypothesis (Ha) : hypothesis that researchers expect to support their models Regression Statistics = probabilistic, random, or stochastic analysis → errors in equations or models; a useful tool e.g., Y=b0+b1X1+b2X2+e Mathematics = deterministic analysis → no errors in equations or models; mathematical language e.g., Y=b0+b1X1+b2X2 Statistical significance: whether relevant estimates are included in statistical confidence interval (C.I) (99%, 95% or 91%) Substantial significance: whether relevant estimates have expected sign (+ or -)and magnitude 2011 Boot Camp

Statistics? Definition The study of thecollection, analysis, and interpretationof DATA related to your research (questions or models) Research question: e.g., Is there any relationship of weight with height? questions or models 6

Statistics? Definition Statistics = probabilistic, random, or stochastic analysis → errorsin equations or models; a useful tool e.g., Y=b0+b1X1+b2X2 + e Mathematics = deterministic analysis → no errors in equations or models; mathematical language e.g., Y=b0+b1X1+b2X2 6

DATA? Definition Datarefers to qualitative (e.g., female/male) or quantitative attributes of a variable or set of variables. the results of measurementsand can be the basis of graphs, images, or observations of a set of variables. Raw Data(=unprocessed data)refers to a collection of numbers and characters. 6

DATA? Example: Raw Data 6

DATA? Purpose To get necessary information and knowledge Data Information Knowledge “Data” is not “information” unless it is interpreted interpretation Discussion; agreement 6

DATA? Structure Observations (=individuals or cases) Data Variables = observations’ attributes Discretevariables 1.Nominal e.g., sex (male or female) 2. Ordinalscale e.g., economic status (low, Middle, high) e.g., Raw Data Continuous variables 3.Interval scale e.g., income ($100,000) 4. Ratio scale e.g., height; weight Level of measurement (scales of measure) 6

Collection of Data Definition The selection of a SAMPLE (a subset of individuals) from within a POPULATION to yield some Information/knowledge about the whole population, especially for the purposes of making predictions based on statistical inference *Population: all people or items with the characteristic that one wishes to understand 6

Collection of Data Structure • Randomness • Representativeness • Surveys (Observation) - Questionnaire Paper Web, etc - Interview Face-to-face Phone,etcetc • Experiments • Control group vs. • experimental group Sampling Population Sample Inference (estimation; prediction) 6

Collection of Data Survey or Interview Questionnaires Nominal (=categorical or dummy) question e.g., your gender? Male___ Female ______ 2. Ordinal-scale question e.g., How much are you satisfied with your annual salary? a. very high b. high c. neutral d. low e. very low 3. Interval-scale question e.g., How is your annual salary? a. below 20,000 b. 20,000 – 50,000 c. 50,000 – 70,000 d. above 70,000 4. Ratio-scale question e.g., What is your height? ________ 6

Collection of Data Randomness and Representativeness The most important conditions to secure reliable sampling, or to eliminate bias Randomness: equal chance of selection (e.g., National lottery) Representativeness: the selection of individuals which are representative of a larger population 6

Collection of Data Randomness and Representativeness Low bias and high precision Quality of Data 6

Collection of Data Normal Distribution Symmetric distribution of values around the mean of a variable (Bell-shape distribution) s.d (s or σ) = 40 s.d (s or σ) = 24 s.d (s or σ) = 19 Mean ( or μ)=70) Mean ( or μ)=30 Mean ( or μ)=10

Collection of Data Normal Distribution (why important?) 1. Distributions of most variables tend to be normal, or they are usually quite close to normal distribution 2. It is easy for mathematical statisticians to work with. This means that many kinds of statistical tests can be derived for normal distributions. 3. If the mean and standard deviation of a normal distribution are known, it is easy to convert back and forth from raw scores to percentiles.

Collection of Data Standard Normal Distribution Standard normal distribution is called “Z distribution” <probabilistic distribution> Z-distribution (n≥30) cf. t-distribution (n<30) N ~ (0, σ2)

Collection of Data Z distribution table t distribution table t= s.d s.e: how likely the mean you are estimating is true mean e.g., Z =1.13=(1.1 + 0.03) e.g., t =1.26 (df=9) 87% 87% take more than 111.3 min when mean time on a review is 100 mins, and s.d is 10 mins.

Collection of Data Central Limit Theorem A foundational concept in statistical inference which states that if a sampling distribution is made up of samples containing more than 30 cases (each), the sample means will be normally distributed 6

Collection of Data Normal distribution: Mean, Median, Mode Mean: arithmetic average of a set of number Median: the middle observation in a group of data when the data are ranked in order of magnitude Mode: the most common value in any distribution 6

Collection of Data Skewedness Left-tail is longer Right-tail is longer Means are distorted by extreme values, or outliers Using median instead of mean If necessary, transform to normality, especially in regression analysis 6

Analysis of Data Purpose A step to find “a pattern of data” to get necessary information and knowledge 6

Analysis of Data Type Descriptive (statistical) analysis Numerical information (such as mean, median, standard deviation) that summarize and interpret some of the properties of a set of data (sample) but do not infer the properties of the population from which the sample was drawn. Inferential (statistical) analysis Deducing (inferring) the properties of a population from the analysis of the properties of a data sample drawn from it 6

Analysis of Data Descriptive Analysis Data distribution analysis: It tells us what values the variable takes and how often each value occur • Mean( (sample); μ (population)) • Arithmetic average or expected value of a variable (χ) • (n = number of observation) • Variance (s2 (sample); σ2(population)) • The average of the squared differences from the mean • Standard Deviation (s (sample); σ (population)) • Ameasure of dispersion, or variation, the square root of variance 6

Analysis of Data Descriptive Analysis • Range: difference between maximum value and minimum value • Min: the lowest, or minimum value in variable • Max: the highest, or maximum value in variable • Q1: the first (or 25th) quartile • Q2: the third (or 75th) quartile • 1 2 3 4 5 6 7 8 9 10 11 12 13 50th Mean or Mode Max Min 25th 6

Analysis of Data Descriptive Analysis • Frequency distribution • A table that shows a body of your data grouped according to numerical values • Example: 6

Analysis of Data Descriptive Analysis Mean: arithmetic average of a set of number Median: the middle observation in a group of data when the data are ranked in order of magnitude Mode: the most common value in any distribution • Height • Mean: 9 • Median: 174 • 164 166 170 172 174 174 180182 187 190 • Mode:174 Variance: =74.77 Standard deviation:

Analysis of Data Descriptive Analysis: Using “Stata”

Analysis of Data Inferential Analysis Relationship analysis between variables: (Explorative analysis or hypothesis testing ) • Main analysis:Mean difference analysis (t-test; ANOVA) and • Relationship analysis (correlation; regression), etc. • Mean difference analysis • t-testfor two groups • Analysis of variance (ANOVA) for multi-groups 6

Analysis of Data Inferential Analysis: Mean Diff • Example.t-test • height difference between male and female • male = 0 • female =1 6

Analysis of Data Inferential Analysis • Relationship analysis • Correlation • Correlation means linear association between two variables • Three types of correlation X2 X2 X2 X1 X1 X1 negative positive zero 6

Analysis of Data Inferential Analysis • Regression (independent and dependent relationships among variables) • 1. Number of independent variable • Single regression model: • Association of one independent variable with one dependent variable • e.g., Y = β0+β1X1+e • where Y is dependent var, Xis independent var, e is error, β0 is intercept, and β1 is slope of X1. • Multiple regression model: • Association of more than two independent variables with one dependent variable • e.g., Y = β0+β1X1+β2X2+e • 2. Shape of regression line • Linear regression model (OLS) • Non-linear regression model (MLE or GLS) Y Y X X 6

Analysis of Data Correlation (r): just relationship between variables (no direction; symmetric relationship) corr(weight, height), -1 ≤ r ≤ 1 Regression: independent relationship of more than one variables with a variable (direction: asymmetric relationship) Weight(Y) = β0+β1*height(X1) + error, 0 ≤ r2 ≤ 1 r2 is the fraction of the sample variance of weight (Y) explained by (or predicted by) height (X1). Causation: a correlation or regression is not same as causation if it does not satisfy 1) time order between variables (cause- effect), 2) no other variables between them, and 3) direction change at the same time. 6

Analysis of Data • Correlation ( r)btwnweight and height: • Regression (r2) btwn height and weight: • r2 tells us that 22.78% of variance in weight is explained by height D.V I.V Slope (β1) Intercept (β0) 6

Analysis of Data Hypothesis Testing Null hypothesis (H0): hypothesis that researchers try to disprove Alternative or research hypothesis (Ha) : hypothesis that researchers expect to support their models 6

Analysis of Data Hypothesis Testing: Example H0: Male is not taller than female Ha: Male is taller than female Pr(T>t)=0.0076 < 0.05 We can accept the hypothesis that male’s height is less than female’s height because the difference of height between female and male is statistically significant at 5% signficnace level. 6

Interpretation of Data Outputs • Interpretation of data outputs based on your background knowledge and experience is the last step of statistics in social science. - Different researchers can interpret the same data in different ways • Support your argument, or your theoretical model based on your interpretation 6

Interpretation of Data Outputs Statistical significance: whether relevant estimates are included in a statistical confidence interval (C.I) (99%, 95% or 90%), or at a significant level (α=0.01(t value=2.58), 0.05 (t value=1.96) or 0.1 (t value=1.64) Substantial significance: whether relevant estimates have expected sign (+ or -)and magnitude 6

Interpretation of Data Outputs • Not statistically significant at 5% significance level, but significant at 1% level; and its sign is positive • If we assume this one is statistically significant, we can interpret that for one center meter increase of height, weight increases by .98 pounds 6

Practice 6

Practice: Questions • Q1. n (number of observations) • Q2. sum of Xs) • Q3. (mean) • Q4. Median • Mode • Q5. Five number summery: • Min (lowest value) • Q1 (25th quartile value) • M (median) • Q3 (75th quartile value) • Max (highest value) • Q6. s2 (variance) • Q7. s (standard deviation) 6

Practice: Answers 6

Practice: Extra Qs 6

Basic Statistics: Conceptual understanding and application