840 likes | 850 Views
Join the Graduate Workshop in Statistics to learn about descriptive statistics, graphical and numerical summaries of data, and a brief demo of SPSS and R. Develop your skills in statistical analysis and data interpretation. Instructor: Kam Hamidieh.
E N D
Welcome to the Graduate Workshop in Statistics Instructor: Kam Hamidieh Monday July 11, 2005
Today’s Agenda • Workshop Introductions & Website Tour • Plan Ahead • Today: All About Descriptive Statistics • Brief Demo of SPSS & R (if time permits)
Before we start… When you see Bert on a slide, either I will go over the slide quickly or skip it entirely. However, you will need to read it on your own since the subsequent sessions will depend on these. When you see Sir Isaac Newton, it means that the slide will be of technical/mathematical nature. Read it if you wish. I will not use the material in the subsequent sessions.
Workshop Plan • July 11: Descriptive Statistics - making graphical and numerical summaries of data • July 18: language of research studies in statistics & a crash course in Probability and Random Variables • July 25: Hypothesis Testing, lots of t-tests, & confidence intervals • August 1: Categorical data & chi-squared tests • August 8: Linear Regression • August 15: ANOVA and catch up.
What is Statistics? • Statistics is the artof learning from data. It is concerned with the collection of data, their subsequent description, and their analyses, which often leads to the drawing of conclusions. • Want to know more about what statistics is and how its meaning has evolved? See Vic Barnett’s Comparative Statistical Inference
Biostatistics • Statistics applied to biological (life) problems, including: • Public health • Medicine • Ecological and environmental • Much more statistics than biology, however biostatisticians must learn the biology also.
Some Additional Terms • Bioinformatics - computerized and statistical analyses of biological data to extract and analyze biological data, particularly in studying the nucleotide sequences of DNA. • Microarray Data – Data with lots of variables and a few observations (more variables than cases). Mostly biological data.
Some Other Applications • Finance: statistical models used for analysis of stocks, bonds, and currencies to control risk or make money • Economics: statistical models used to forecast economic trends • Clinical trials: testing effectiveness of drugs • Information technology: network traffic analysis, pattern recognition, separation of noise from data • Business: fraud detection • Government: analysis of current economic situation, forecasting, opinion polling
Further Reading for Pleasure! • For a detailed history of statistics see Stephen M. Stigler’s History of Statistical Concepts and Methods. • For new approaches to statistics see Breiman’s Statistical Modeling: The Two Cultures.
The Big Picture in Statistics Use a small group of units to make some conclusions (inference) about a larger group Population (Characteristics Unknown) Sample
Populations and Parameters • Population – a group of individuals (or things) that we would like to know something about • Parameter - a characteristic of the population in which we have a particular interest • Often denoted with Greek letters (µ, ) • Examples: • The proportion of the population that would respond to a certain drug • The population average height of males in Michigan
Samples and Statistics • Sample – a subset of a population (hopefully representative and random) • Statistic – a characteristic of the sample (any function of the sample data) • Example: • The observed proportion of the sample that responds to treatment • The observed average height of males in Michigan
Example • A sample of 1000 women between the ages of 30 and 39 is randomly chosen across the US for a marketing study. The results are: 825 women prefer product A over product B (or 175 prefer B over A) • Population? • Sample? • Parameter? • Statistic? Population of all women between the ages 30 and 39, living in the US 1000 women, 30-39, sampled in the survey Population proportion of women 30-39 in the US preferring A over B – this is unknown Sample proportion of women 30-39 in r.s. of n=1000 who preferred A over B. Here the value of this statistic is 825/1000 or 82.5%
Populations and Samples • Studying populations is too expensive and time-consuming, and thus impractical • If a sample is representative of the population, then by observing the sample we can learn something about the population • And thus by looking at the characteristics of the sample (statistics), we may learn something about the characteristics of the population (parameters).
Issues • Samples are random • If we had chosen a different sample, then we would obtain different values for the statistics (although we are trying to estimate the same (unchanged) population parameters). • Samples must should represent the population
Explanatory and Response Variables • Many questions in statistics are about the relationship between two or more variables. • It is useful to identify one variable as the explanatory and the other variable as the response variable. • In general, the value of the explanatory variable for an individual is thought to partially explain or account for the value of the response variable.
Explanatory and Response Variable • Other names: • Explanatory: independent, factor, treatment, input, x • Response: dependent, y, output
Statistical Analyses • Descriptive Statistics • Describe the sample – use numerical and graphical summaries to characterize a data set • Inference • Make inferences about the population • Primarily performed in two ways: • Hypothesis testing • Estimation • Point estimation • Interval estimation
Descriptive Statistics - Data • Pieces of information • Types of Data • Categorical Data: • Nominal – unordered categories • Ordinal – ordered categories • Quantitative Data • Discrete – only whole numbers are possible, order and magnitude matters • Continuous – any value is conceivable
Summary of Data Types Types of Data Categorical Quantitative Nominal Ordinal Discrete Continuous
Examples of Data Types • Age (years) • Car Manufacturer (GM, Ford, etc.) • Starting Salary in Dollars • Starting Salary (Low, Med., High) • Calcium Level (microgram per liters) • Current Smoker (yes or no) • Number on the flip of a die quantitative, continuous categorical, nominal quantitative, continuous categorical, ordinal quantitative, continuous categorical, nominal quantitative, discrete
Data • The vast majority of errors in research arise from a poor planning (e.g., data collection) • Fancy statistical methods cannot rescue garbage data • Collect exact values whenever possible
On Descriptive Statistics • It is ALWAYS a good idea to summarize your data • You become familiar with the data and the characteristics of the people/things that you are studying • You can also identify problems or errors with the data • This is the first the step in any statistical analysis
Dataset Structure • Think of data as a rectangular matrix of rows and columns. • Rows represent the “experimental unit” (e.g., person) • Columns represent variables measured on the experimental unit
Example Data Set • Data are for 11 variables and n = 1,606 respondents in the 1993 General Social Survey, a national survey done by the National Opinion Research Center at the University of Chicago. Some questions are only asked of about two-thirds of the survey participants, so there is quite a bit of missing data. (Source: SDA archive at UC Berkeley website, http://csa.berkeley.edu:7502) • I will be using a smaller version of it with only n=500.
Example Data Set Column Name Description C1 sex Sex of respondent C2 race Race of respondent (White, African American, Other) C3 degree Highest educational degree received (Five categories) C4 relig Religious preference (Catholic, Protestant, Jewish, Other) C5 polparty Does respondent think of self as Democrat, Republican, Indep. or Other? C6 cappun Does the respondent favor or oppose the death penalty. C7 tvhours Hours of watching television on a typical day C8 marijuan Whether the respondent thinks marijuana should be legalized or not C9 owngun Whether respondent owns a gun or not (Yes or No) C10 gunlaw Does respondent favor or oppose a law requiring a permit to buy a gun? C11 age Age of the respondent
Summarizing Categorical Data • Numerical Summaries • Frequency/Count tables • Visual Summaries • Pie Charts – good for summarizing a single categorical variable • Bar Charts – good for summarizing one or two categorical variables and useful for making comparisons when there are two categorical variables
Numerical Summary of Categorical Data • Count how many fall into each category • Calculate the percent in each category • If two variables, have the categories of the explanatory variable define the rows and compute the row percentages
Example Numerical Summary of the Sex Variable
Example Bar Chart of the Sex Variable Percent
Example Numerical Summary of political party affiliation Question: What percentage of people in the US identify themselves as democrat/independent/republican/other? At least we have some descriptive information from the table above: most people seem to identify themselves as independents while the percentage of the democrat and the republicans seem to be very close.
Example Visual Summary of the political party affiliation Percent
Example Numerical Summary of the Sex Variable vs. Political Party Affiliation Question: Is there a difference in party affiliation (in %) between the men and the women? Again some descriptive information is available. There does not seem to be a big difference.
Example Numerical Summary of the political party affiliation vs. own a gun
Example Look at these! Question: Is there a relationship between gun ownership and party affiliation? Descriptively, there seems to be a relationship. Most republicans seem to be gun owners.
Questions to Ask – 1 Categorical Variable Question: How many and what percentage of individuals fall into each category? Example: What percentage of college students favor legalization of marijuana? Question: Are individuals equally divided across categories or do the percentages across categories follow some other interesting pattern? Example: When individuals are asked to choose a number from 1 to 10, are all numbers equally likely to be chosen?
Questions to Ask – Categorical Variables Question: Is there a relationship between the two categorical variables, so that the category into which individuals fall for one variable seem to depend on which category they are in for the other variable? Example: Is there a relationship between gun ownership and party affiliation? Another Example: The relationship between smoking and lung cancer was detected in part, because someone noticed that the combination of being smoker and having cancer is unusual.
Descriptive Statistics – Quantitative Data • We will use a new data set from http://www.infoplease.com/ipa/A0194030.html on the age of presidents at inaugural
Interesting Features of Quantitative Variables • Quick glance at the data values (Bloody Eyeball Test!) • Location: where most values lie or the value that represents the data best e.g. mean or median • Spread: variability in data • Shape: a bit later… • Five number summary: find extreme (high, low), the median, and the quartiles (median of lower and upper halves of the values).
Location of a Data Set: Mean, Median, and Mode • Mean: the numerical average, sum the data and then divide by the number of data points • Formula: • Median: the middle value (if n odd) or the average of the middle two values (n even) once the data have been ordered. 50% of data are above the median and 50% are below the median. • Mode: it is the measurement that occurs most often.
Some Word about Notation Notation for Data: n = number of individuals in a data setx1, x2 ,x3,…,xnrepresent individual raw data values Example: A data set consists the president’s age at inaugural; the values are 51, 61, …, 46, and 54. Then, n = 43x1= 51, x2 = 61,…, x42 = 46, andx43 = 54
Example of Mean • What is the average age of the US Presidents at inaugural? Mean age = (57 + 61 + … + 46 + 54)/43 = 55
Example of Median • What is the median age of the US President at inaugural? (n=43, n is odd) Note: Data has been sorted. Note n=43, n is odd, take the (43+1)/2 = 22, or 22nd value which is 55
A Bit More About Median Median Calculations If n is odd: M = middle of ordered values.Count (n + 1)/2 down from top of ordered list. If n is even: M = average of middle two ordered values.Average values that are (n/2) and (n/2) + 1 down from top of ordered list. Say you have the following list of numbers:18,29,33,45,88,100 The median here is the average of 33 and 45 so(33 + 45)/2 = 39.
Describing Spread/Variability in Data • Range = highest/max value – lowest/min value • Interquartile Range (IQR) = upper quartile – lower quartile • Standard Deviation: a bit later….
Describe the Spread - Quartiles • Split the ordered values into half that is below the median and the half that is above the median. • Q1 = lower quartile = median of data values that are below the median • Q3 = upper quartile = median of data values that are above the median • Q2 is the just the median • IQR, Interquartile Range = Q3 - Q1 • Min, Max, Median, Q1, and Q3 used in creation of boxplots Min Q1 Med Q3 Max 25% 25% 25% 25%
Example Using the Presidents Age Data Max - Min Min Max Q1 Q3 • About 25% of the presidents were 51 years old or younger. • About 75% were 58 or less. • About 50% (the middle 50%) were between the ages of 51 and 58. IQR = 58-51=7 • The oldest was 69 (Reagan) and the youngest 42 (T. Roosevelt). Range = 69 – 42 = 27. • About 50% were 55 or less or equivalently about 50% were 55 or older.
The Spread and Shape of Data are important! Case I Case II • Suppose 20 people take exams. Possible scores go from 0 to 100. The average score is 87. Bob got an 88. How well do you think he did? Case I: Bob is hot! Case II: Bob is not so hot! Just knowing the mean or the median is not enough. We need to know something about the spread and shape of data.
Graphical Summaries for Quantitative Data • Histograms: similar to bar graphs, used for any number of data values • Stem and Leaf plot and dot plots: present all the individual values, useful for small to moderate sized data sets. • Boxplots: useful summary for comparing two or more groups. • Scatter Plot: very useful for exploring relationships between two variables