810 likes | 829 Views
Basic Concepts in Statistics The Background Preparation for Data Analysis. Statistical Analysis. Inferential Statistics -testing hypothesis. Describing Data. Computing Descriptive Statistics. Visualizing Data -looking at distribution. Identify values that appear be unusual.
E N D
Basic Concepts in StatisticsThe Background Preparation for Data Analysis by Teoh Sian Hoon
Statistical Analysis Inferential Statistics -testing hypothesis Describing Data Computing Descriptive Statistics Visualizing Data -looking at distribution • Identify values that appear be unusual • Check the original records to make sure that these values are not the results of errors in coding • Detect outliers/ non-normality ..(many of the statistical procedures require that the distribution be more or less symmetric) • to determine whether the statistical techniques that we are considering for data analysis are appropriate. by Teoh Sian Hoon
1. Basic Concept • - statistical terms Parameter population mean, Statistic sample mean, by Teoh Sian Hoon
Jam -- RM1950 Jack – RM2000 Man – RM 2200 Su -- RM2500 San – RM2800 Tan – RM3000 Jul -- RM 3100 Mad – RM3700 Ina – RM3900 Shan – RM4000 • 1. Basic Concept (continued) • - mean Income Mean = RM2,915 Median= RM2,900 by Teoh Sian Hoon
Jam -- RM1950 Jack – RM2000 Man – RM 2200 Su -- RM2500 San – RM2800 Tan – RM3000 Jul -- RM 3100 Mad – RM3700 Ina – RM3900 • 1. Basic Concept (continued) • - descriptive statistics Mean = ?????? Bill Gates – ?????? by Teoh Sian Hoon
Jam -- RM1950 Jack – RM2000 Man – RM 2200 Su -- RM2500 San – RM2800 Tan – RM3000 Jul -- RM 3100 Mad – RM3700 Ina – RM3900 • 1. Basic Concept (continued) • - descriptive statistics How widely the values in the dataset are spread apart? Mean =????? Median = ???? if Bill Gates – RM500, 000 by Teoh Sian Hoon
Jam -- RM1950 Jack – RM2000 Man – RM 2200 Su -- RM2500 San – RM2800 Tan – RM3000 Jul -- RM 3100 Mad – RM3700 Ina – RM3900 • 1. Basic Concept (continued) • - descriptive statistics But those other nine people didn't become millionaires just because Bill Gates was included. Mean =RM52,515 Median = RM2,900 if Bill Gates – RM500, 000 by Teoh Sian Hoon
1. Basic Concept (continued) • - descriptive statistics standard deviation by Teoh Sian Hoon
1. Basic Concept (continued) • - descriptive statistics standard deviation A measure of dispersion around the mean. by Teoh Sian Hoon
1. Basic Concept (continued) • - Normal Distribution 68.3% 95.4% In a normal distribution, 68.3% of cases fall within one SD of the mean and 95.4% of cases fall within 2 SD. For example, if the mean age is 45, with a standard deviation of 10, 95.4% of the cases would be between 25 and 65 in a normal distribution. by Teoh Sian Hoon
1. Basic Concept (continued) • - Normal Distribution mean age is 45 standard deviation of 10 68.3% 95.4% by Teoh Sian Hoon
skewness value of zero. f variable Mean = Median = Mode • 1. Basic Concept (continued) • - skewness A normal distribution is symmetric by Teoh Sian Hoon
f variable Mode Median Mean f Mean Median Mode variable • 1. Basic Concept (continued) • - skewness • Skewed to the left or negatively skewed: • A distribution with a significant negative skewness has a long left tail. • The value of the mean is the smallest and the mode is the largest, with the value of the median lying between these two values. • Skewed to the right or positively skewed: • A distribution with a significant positive skewness has a long right tail. • The value of the mean is the largest , the mode is the smallest and the median lies between these two values by Teoh Sian Hoon
1. Basic Concept (continued) • - skewness How skewed a distribution can be before it is considered a problem? |-0.067| < 2 (0.687) As a rough guide, a skewness value more than twice it's standard error is taken to indicate a departure from symmetry. by Teoh Sian Hoon
1. Basic Concept (continued) • - kurtosis Kurtosis = 0 mesokurtic Kurtosis > 0 leptokurtic Kurtosis < 0 platykurtic In general, kurtosis is not very important for an understanding of statistics by Teoh Sian Hoon
1. Basic Concept (continued) • - Central Limit Theorem for n large then If by Teoh Sian Hoon
1. Basic Concept (continued) • - SPSS Data Engine weight 307 350 318 304 302 429 454 440 455 390 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 by Teoh Sian Hoon
1. Basic Concept (continued) • - SPSS Steps: • Analyze • Descriptive Statistics • Explore • In Dependent List select ‘weight’ • In Statistics select Descriptive and Outliers • In Plot select Histogram and Normality lot with Tests by Teoh Sian Hoon
1. Basic Concept (continued) • - SPSS The 5% trimmed mean excludes the 5% largest and 5% smallest values. |-0.067| < 2 (0.687) The trimmed mean provides an alternative to the median when there are some data values that are far removed from the rest. by Teoh Sian Hoon
1. Basic Concept (continued) • - SPSS Significance levels are reasonably large, indicating that normality is not an unreasonable assumption. by Teoh Sian Hoon
1. Basic Concept (continued) • - Example of study Refer to Appendix A : GRAIN-SIZE ANALYSIS (Geology) http://darkwing.uoregon.edu/~dogsci/dorsey/geo334/Lab5.pdf#search='using%20skewness' by Teoh Sian Hoon
1. Basic Concept (continued) • - : GRAIN-SIZE ANALYSIS (Geology) Grain-size distribution of sediments is important for characterizing substrate behavior in engineering and hazards applications, and is commonly analyzed as part of soil and sedimentation surveys in Quaternary and older sediments. It is especially important in studies of dam effects in regulated rivers, because the size distribution of the sediment determines to a large extent whether it is transported and where it will be stored under the regulated flow regime. Attempts to use plots of statistical parameters to identify sediments of different depositional environments sparked great interest in the early 1960’s. Parameters that are commonly used include the first four statistical moments: mean, standard deviation, skewness, and kurtosis, as described by Boggs (p. 64-71). Friedman (1961) attempted to distinguish between beach and river sands using skewness and standard deviation. by Teoh Sian Hoon
2. Graphs Data Engine weight 307 350 318 304 302 429 454 440 455 390 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 by Teoh Sian Hoon
2. Graphs - Histogram Steps: • Graphs • Histogram • Enter “vehicle weight” • Mark “Display normal curve” by Teoh Sian Hoon
2. Graphs - Histogram by Teoh Sian Hoon
whether the distribution symmetric. • Look for separate clumps of data values. 2. Graphs - Histogram by Teoh Sian Hoon
2. Graphs - Boxplots • From the menus, choose: • Graphs Boxplot • In the Boxplot initial dialog box, select the icon for simple. • Select an option under Data in Chart Are. • Select Define. • Select variables and options for the chart. by Teoh Sian Hoon
2. Graphs - Boxplots Extreme outlier * Mild outlier Largest observed value that is not an outlier 3rd Quartile (Q3) Median (Q2) 1st Quartile (Q1) Smallest observed value that is not an outlier by Teoh Sian Hoon
2. Graphs - Boxplots by Teoh Sian Hoon
2. Graphs - P-P Plot Steps: • Graphs • P-P Plots • Enter “vehicle weight” • Test distribution “Normal” by Teoh Sian Hoon
2. Graphs 2. Graphs - P-P Plot by Teoh Sian Hoon
2. Graphs 2. Graphs - P-P Plot by Teoh Sian Hoon
2. Graphs 2. Graphs - P-P Plot by Teoh Sian Hoon
3. Inferential Statistics • Testing Hypotheses by Teoh Sian Hoon
To describe a population Objectives To determine a significant difference (s) Comparing a sample (s) Research Questions Types of Analysis To analyze the significance Of relationship between 2 variables Descriptive Independent Variables Inferential Variables Dependent Variables by Teoh Sian Hoon
t - test • In one group • Between 2 groups ANOVA • > 2 groups To describe a population To determine a significant difference (s) comparing a sample (s) Objectives To analyze the significance of relationship between 2 variables by Teoh Sian Hoon
Z test and Chi Square by Teoh Sian Hoon
rank count Describing a Population Level of measurement for the dependent variable Interval/ ratio Ordinal Nominal mean Variance proportion median Z-test Chi-square test by Teoh Sian Hoon
rank count Describing a Population Level of measurement for the dependent variable Interval/ ratio Ordinal Nominal mean Variance proportion median The program does not have an option for a one-proportion z-test. However, the Chi-Square goodness of fit test can be used to produce an equivalent result Z-test Chi-square test by Teoh Sian Hoon
Example: Z test / Chi Square To test the proportion of female engineer that attended the event is different than the proportion of male engineer. by Teoh Sian Hoon
Hypothesis: Ho: p = .5 H1 : p ¹ .5 by Teoh Sian Hoon
Steps: Data weight cases weight cases by freq Analyze nonparametric tests chi-square Test variable list gender Expected values all categories equal by Teoh Sian Hoon
Conclusion Since p-value= 0.527>0.05, do not reject H0 . The proportion of female engineer that attended the event is equal to the proportion of male engineer. by Teoh Sian Hoon
Statistical Terms In many areas of research, the p-value of .05 is customarily treated as a "border-line acceptable" error level. by Teoh Sian Hoon
Example: Chi Square • The marketing manager for an automobile manufacturer is interested in determining the proportion of new compact-car owners who would have purchased a passenger-side inflatable air bag if it had been available for an additional cost of RM300. The manager believes from previous information that the proportion is .30. Suppose that a survey of 200 new compact-car owners is selected and 79 indicate that they would have purchased the air bags. by Teoh Sian Hoon
Example: Chi Square (continued) • Since this is a hypothesis test for the proportion, it will be a Z-test. • At the .10 level of significance, is there enough evidence that the population proportion is different from .30? by Teoh Sian Hoon
Hypothesis by Teoh Sian Hoon
Conclusion Since p-value= 0.003 < 0.10, we reject H0 . Therefore, the population proportion is different from 0.30. by Teoh Sian Hoon
Example: Chi-Square Level of education attained by the women from a rural region is divided into three categories: can read/write degree; primary degree; secondary and above degree. A demographer estimates that 28% of them have can read/write degree, 61% have primary degree and 11% have higher secondary degree. In order to verify these percentages, a random sample of n = 100 women at the region were selected and their level of education recorded. The number of the women whose level of education falling into each of the three categories is shown in the following table. by Teoh Sian Hoon
Example: Chi-Square (continued) by Teoh Sian Hoon