INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS

INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS Developed by Jennifer Lewis Priestley, Ph.D. Kennesaw State University

Statistics Workshop Topics MODULE ONE: Concept Review • Review of Statistical Concepts • Data Analysis using EXCEL MODULE TWO: Inferential testing using EXCEL and SAS • Confidence Intervals • Ttests • ANOVA • Chi-Square MODULE THREE: Predictive Modeling using EXCEL and SAS • Regression Analysis • Logistic Analysis • Discriminant Analysis c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Review of Statistical Concepts Prior to analysis, a determination must be made of the type of variables in question. Variable type will, in many cases, dictate the analysis options. Variable types for review: Qualitative categorical (e.g., gender, race) ordinal (e.g., rankings, Likert data*) Quantitative interval (e.g., temperature) ratio (e.g., weight, height) * Mathematicians and Statisticians consider Likert data to be qualitative and therefore restrict its use to qualitative techniques such as Chi-square analysis. However, in practice, most people treat Likert data as quantitative and utilize quantitative techniques. c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Review of Statistical Concepts • Descriptions of data typically include: • Measurements of Central Tendency • Mean, Median, Mode (do you know when to use each?) • Measurements of Dispersion • Standard Deviation and Variance • Outlier Detection: • For near bell-shaped data • Use 3-Sigma (Empirical) Rule : any value that is more than 3 standard deviations above or below the mean • For Skewed Data • Use Tukey’s Rule: any value that is more than one step below Q1 or more than 1 step above Q3; A step = 1.5*IQR c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Review of Statistical Concepts The Central Limit Theorem forms the basis for why inferential statistics (versus descriptive statistics) is possible. Prior to reviewing the Theorem, pull up this site: http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Review of Statistical Concepts Important concepts to remember about the Central Limit Theorem: • The distribution of sample means will, as the number of samples increases approach a normal distribution; • The mean of all sample means approximates the population mean; • The std of all sample means is the std of the population/the SQRT of the sample size; • If the population is NOT normally distributed, sample sizes must be greater than 30 to assume normality; • If the population IS normally distributed, samples can be of any size to assume normality (although greater than 30 is always preferred). c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Review of Statistical Concepts • A few points to remember about the Normal Distribution: • Bell shaped and symmetric about the mean μ. • Mean = μ, Median = μ, Mode = μ. • The area under the normal curve below μ is .5. • Probability that a Normal Random Variable Outcome: • Lies within +/- 1 std dev of the mean is .6826 • Lies within +/- 2 std dev of the mean is .9544 • Lies within +/- 3 std dev of the mean is .9974 • For all other probabilities, convert the relevant observation to a Z-score: Z=(x- μ)/ σ • Any observation that has a Z-score greater than 2 is typically considered to be a statistical outlier…since its probability of occurrence is less than 5%. c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Review of Statistical Concepts Hypothesis Testing: c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Review of Statistical Concepts • Statement of Hypotheses • H0: Null Hypothesis – opposite of Alternative Hypothesis • H1: Alternative Hypothesis – what we are trying to prove • Evaluate Type I & Type II Errors • Set Significance level,  • Standard of Proof, or Level of Risk • Represents Probability of Type I Error • Calculate the test statistic from the sample. • Calculate the p-value (strength of the evidence) • 1. If the p-value < , Accept H1. • 2. If the p-value > , Accept H0. c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL • In this section, we will use EXCEL to execute some of the most common types of univariate/bivariate and multivariate data analysis: • Descriptive Statistics • Histograms • Scatterplots and Charts • Pivot Tables • Using Formulas (fx) • Look Up Tables • Lagniappe c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Speed = Amount of time it takes to deliver a product (in days) once the order has been received. PriceLv = Perceived level of price charged by suppliers PriceFlx = Perceived willingness of COMP1 managers to negotiate price Man_Imag = Overall image of manufacturer Service = Overall level of service necessary to maintain a satisfactory relationship with customer Sal_Imag = Overall perceived image of salesforce Quality = Perceived level of product quality MODULE 1: COMP1 Dataset The dataset used throughout this workshop is the COMP1 dataset, with 14 variables and 100 observations. • Size = Large (1) or Small (0) • Usage = Percentage of total product purchased from COMP1 • Satisf = How satisfied purchaser is with Comp1 • SpecBuy = Extent to which a purchaser evaluates each purchase separately (1=each purchase evaluated separately, 0 = lot buying) • Procure = Centralization of purchase decisions (1= Centralized, 0=decentralized) • Ind_Type = Classification of Industry affiliation (1=Industry A, 0=other) • Buy_Sit = Type of buying situation (NEW = New, MOD = Modified purchase, REP = Repeat Purchase c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL • Using the COMP1 dataset, determine the following descriptive statistics for each quantitative variable: • The most appropriate measurement of central tendency; • Two measurements of dispersion; • For one variable, identify if any outliers exist. • Determine these statistics using the f(x) options: • =AVERAGE, =MEDIAN, =MODE, =STDEV, =VAR c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL • Using the COMP1 dataset, determine the following descriptive statistics for each quantitative variable: • The most appropriate measurement of central tendency; • Two measurements of dispersion; • For one variable, identify if any outliers exist. • This time, use the “Descriptive Statistics” option: • TOOLDATA ANALYSISDESCRIPTIVE STATISTICS c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL • Using the COMP1 dataset, develop a histogram for the quant variable of your choice (use three categories). • EXAMPLE: Take the Usage variable and subtract the min value from the max value (65-25 = 40). Divide that number (the range) by the number of desired categories (40/3 = 13.33). Now, “massage” that figure as necessary to create reasonable, logical categories of approximately equal size: • 25-39 • 40-54 • 55-65 c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL • A few points about histograms in EXCEL: • Create a column over to the right that contains only the TOP of each category that you have assigned • Label this category “BIN RANGE” • TOOLSDATA ANALYSISHISTOGRAM • Ensure that the “Labels” box is ticked • Ensure that the “Cumulative Percentage” and “Chart Output” boxes are ticked c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL Using the COMP1 dataset, develop a scatterplot of two quantitative variables of your choice. EXAMPLE: Chart the two variables Usage and Price Level. NOTE: it is helpful to move the two variables into columns next to each other c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL Using the COMP1 dataset, develop a pivot table of the entire dataset. DATAPIVOT TABLES NOTE: ensure that the labels are included in the table c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL A few more widely used examples of functions in EXCEL: =IF =AND =ABS =SUMPRODUCT =RAND =RANDBETWEEN c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 1: Data Analysis Using EXCEL Look up tables in EXCEL can be a very helpful way of converting quantitative data into categorized qualitative data…which sometimes can be easier to work with. Use the VLOOKUP function to categorize the Speed variable into “EXCELLENT” “AVERAGE” and “POOR” events. Clarification of EXCEL notation : VLOOKUP(enter the speed value here,enter the range of the two column table here,enter the column number of the desired category label here) c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Lagniappe: This word derives from New World Spanish la ñapa, “the gift,”. The word came into the Creole dialect of New Orleans and there acquired a French spelling. It is still used in the Gulf States, especially southern Louisiana, to denote a little bonus that a friendly shopkeeper might add to a purchase. One Lagniappe presented here is the process of selecting “the best” option from among several alternatives. c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Lagniappe for everyone! c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

Statistics Workshop Topics • MODULE ONE: Concept Review • Review of Statistical Concepts • Data Analysis using EXCEL • MODULE TWO: Inferential testing using EXCEL and SAS • Confidence Intervals • Ttests • ANOVA • Chi-Square • MODULE THREE: Predictive Modeling using EXCEL and SAS • Regression Analysis • Logistic Analysis • Discriminant Analysis c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals We always prefer to use descriptive statistics. However, often we are forced to take a sample and use inferential statistics because of issues related to cost, time, money or access. When taking a sample, we can estimate a population parameter such as a mean or a proportion using the sample statistic (which is not very accurate) or we can calculate a confidence interval. c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals Confidence Intervals are calculated using two formulas: CI for the population mean: x+ Z*(s)/SQRT(n) where, x = the sample mean Z = Z-score (90%CI = 1.645, 95%CI = 1.96, 99%CI = 2.575) s = sample standard deviation n = sample size Note: the part of the expression after the + is called the Margin of Error  Z*(s)/SQRT(n). c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals The CI for the population proportion is represented by: p+ Z*SQRT((pq)/n) where, p = the sample proportion Z = Z-score (90%CI = 1.645, 95%CI = 1.96, 99%CI = 2.575) q=1-p n = sample size Note: the part of the expression after the + is called the Margin of Error  Z*(s)/SQRT(n). c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals Fun Manual Calculation! using the Gallup Website: http://poll.gallup.com/ Replicate the Gallup prediction using the second CI formula. c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals More fun calculations! Using the Comp1 dataset, calculate the 95% CI for a quantitative variable using formula 1. =CONFIDENCE Alpha is the accepted prob. of making a T1 error. It is also 1-the confidence level. c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals The result generated from =CONFIDENCE(.05, 1.32, 100) is 0.258857. What does this number mean? This is the Margin of Error. In other words, if you were to report the 95% confidence interval for this company’s speed, you would report: 3.52 + .25 Or 3.25 to 3.77 c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals Lets execute CIs in SAS. First, calculate the 95% CI for the quantitative variable Speed. Use the following code: ProcImport datafile = "c:\COMP1.xls" OUT = COMP1 DBMS = "EXCEL97" Replace; run; ProcPrintdata=Comp1; Run; ProcMeansdata=Comp1 CLMalpha=.05; Var Speed; Run; Should you require a higher or lower alpha (.01 is more conservative and .10 is more risk tolerant), change the .05 as appropriate. c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals The output is pretty simple: Notice that this is the same interval from the EXCEL output. Isnt it nice when numbers match? c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals Now, calculate the 95% CI for the qualitative variable Procure. Use the following code: ProcFreqdata=Comp1; Tables Procure/Binomialalpha=.05; Run; Note that EXCEL does not readily support this calculation. c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing – Confidence Intervals The output is pretty simple: c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing • The Chi-Square Test is used to determine if two QUALITATIVE variables are related. • The Chi-Square statistic is computed using the following formula: • X2 = Σ(fo-fe)2/fe • Where: • fo is the frequency of the observed value • fe is the frequency of the expected value • This calculated test statistic is converted into a p-value to evaluate the presence of a relationship (or not). c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing • Using the COMP1 dataset, determine if a company’s buying situation (BuySit) is related to the centralization of their purchasing decisions (Procure). • The hypothesis statements for this test are: • Ho: BuySit and Procure are NOT related • H1: BuySit and Procure ARE related • Develop the appropriate testing matrix and identify the Type1 and Type2 errors. c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing Now, using EXCEL, develop the 2x3 matrix of these two variables using a pivot table (place “count of procure” in the center). You should see this: This table includes the “frequencies of the observed” values from our Chi-Square formula. c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing Now, we need to determine the expected values. If there is NO relationship, then we would expect to see exactly 32% of the decentralized procurement companies with a modified buying situation, 34% with a new buying situation, etc. The matrix of expected values looks like this: =(32/100)*50 c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing Now use the =CHITEST formula in EXCEL. The actual range requested is the INTERIOR of the observed matrix and the expected range is the INTERIOR of the expected matrix (note that the marginal values are the same for the two matrices). The resulting value is: 1.61E-09 or .00000000161. c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing This is a p-value. Given the rule : if p<a reject Ho and p>a accept Ho… Can we conclude that there is a relationship between the centralization of the purchasing decision and the buying situation? c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing Lets execute the same test in SAS, using the following code: ProcFreqdata=Comp1; Tables Procure*BuySit/CHISQ; Run; c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing This code creates the following output: This legend provides the key to interpreting these numbers c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing Here is the second part of the output: This is the same p-value that we obtained in EXCEL c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing Any additional questions on Chi-Square? c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing • Ttests represent the most common form of statistical testing. It involves either one sample or two independent samples. • One Sample Ttest - compares the mean of the sample to a given number. • e.g. Is average monthly revenue per customer who switches >$50 ? • Formal Hypothesis Statement examples: • H0:   $50 • H1:  > $50 H0:  = $50 H1:   $50 c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing • Two Sample Ttest - compares the mean of the first sample minus the mean of the second sample to a given number. • e.g. Is there a difference in the production output of the two facilities? • Formal Hypothesis Statement examples: • H0: a  b • H1: a > b H0: a = b H1: a  b c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing • Paired Sample Ttest - compares the mean of the differences in the observations to a given number. • e.g. Is there a difference in the production output of a facility after the implementation of new procedures? • Formal Hypothesis Statement example: • H0: post - pre <=0 • H1: post - pre > 0 c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing using EXCEL • Using the COMP1 dataset and EXCEL, determine if there is a difference in the overall satisfaction of large versus small companies (the large companies will represent a sample and the small companies will represent a second sample). • Determine the null and alternative hypothesis statements for this question…then develop the 2x2 hypothesis matrix…including the Type1 and Type2 errors. • Now, sort the data by size. The satisfaction values associated with the 0s (the small firms) will be our first array of numbers and the satisfaction values associated with the 1s (the large firms) will be our second array of numbers. c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing using EXCEL Satisfaction of small firms Satisfaction of large firms Two tailed test homoscedastic Your computed value should be 1.80363E-06…what do you conclude? c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

MODULE 2: Inferential Testing using SAS • Using the COMP1 dataset and SAS, determine if there is a difference in the overall satisfaction of large versus small companies. • Here is the necessary code: • ProcTtestdata=Comp1; • Var Satisf; • Class Size; • Run; the quantitative variable of interest the qualitative variable which identifies the two samples c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics

INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS