830 likes | 1.12k Views
Chapter 1 . Data Collection. Section 1.1. Introduction to the Practice of Statistics. Statistics. The science of statistics is Collecting Organizing Summarizing Analyzing information to draw conclusions or answer questions Statistics provides a measure of confidence in any conclusion.
E N D
Chapter 1 Data Collection
Section 1.1 Introduction to the Practice of Statistics
Statistics • The science of statistics is • Collecting • Organizing • Summarizing • Analyzing information to draw conclusions or answer questions • Statistics provides a measure of confidence in any conclusion
Data • Solve 3x + 5 = 11 • Everyone (should) get the same answer • How long was your drive (or walk) to class today? • Different answers…this is why we need statistics! • We can then break down the data to meaningful information
Statistics and mathematics have similarities but are different • Mathematics • Solves problems with 100% certainty • Has only one correct answer • Statistics, because of variability • Does not solve problems with 100% certainty (95% certainty is much more common) • Frequently has multiple reasonable answers
Population vs. Sample • A population (Greek μ) • Is the group to be studied • Includes all of the individuals in the group • A sample • Is a subset of the population • Is often used in analyses because getting access to the entire population is impractical
Population vs. Sample • Population Example • People 18 years and older • Sample Example • Students at SHU 18 and older
Parameter vs. Statistic • A statistic is a numerical summary of the sample • Descriptive statistics organize and summarize the data in ways such as tables and graphs • Inferential statistics use the sample results and extend them to the population so we can measure the reliability of the results • A Parameter is a numerical summary of a population
Example • Suppose the actual percentage of all students at SHU that own a car is 48.2% • This is a ________________________ • We surveyed 100 students and found 46% own a car • This is a _________________________
The Process of Statistics • Identify the research objective: what do you want answered • Collect he data needed to answer the question: Usually a sample (1.2 – 1.6) • Describe the data: the descriptive statistics (ch. 2 – 4) • Perform Inferences: Use appropriate techniques to test reliability for population (ch. 9 – 12)
Variables • Characteristics of the individuals under study are called variables • Some variables have values that are attributes or characteristics … those are called qualitative or categorical variables • Some variables have values that are numeric measurements … those are called quantitative variables
Qualitative Variables • Examples of qualitative variables • Gender • Zip code • Blood type • States in the United States • Brands of televisions • Qualitative variables have category values … those values cannot be added, subtracted, etc.
Quantitative Variables • Examples of quantitative variables • Temperature • Height and weight • Sales of a product • Number of children in a family • Points achieved playing a video game • Quantitative variables have numeric values … those values can be added, subtracted, etc.
Discrete Vs. Continuous • Quantitative variables can be either discrete or continuous • Discrete variables • Variables that have a finite or a countable number of possibilities • Frequently variables that are counts • Continuous variables • Variables that have an infinite but not countable number of possibilities • Frequently variables that are measurements
Discrete Variables • Examples of discrete variables • The number of heads obtained in 5 coin flips • The number of cars arriving at a McDonald’s between 12:00 and 1:00 • The number of students in class • The number of points scored in a football game • The possible values of qualitative variables can be listed
Continuous Variables • Examples of continuous variables • The distance that a particular model car can drive on a full tank of gas • Heights of college students • Sometimes the variable is discrete but has so many close values that it could be considered continuous • The number of DVDs rented per year at video stores • The number of ants in an ant colony
Section 1.2 Observational Studies Versus Designed Experiments
Observational Study • A survey sample is an example of an observationalstudy • An observational study is one where there is no attempt to influence the value of the variable • An observational study is also called an expostfacto (after the fact) study • Advantages • It can detect associations between variables • Disadvantages • It cannot isolate causes to determine causation
Designed Experiment • A designedexperiment is an experiment • That applies a treatment to individuals • Often compares the treated group to a control (untreated) group • Where the variables can be controlled • Advantages • Can analyze individual factors • Disadvantages • Cannot be done when the variables cannot be controlled • Cannot apply in cases for moral / ethical reasons
Lurking & Confounding Variables • A danger in observational studies are confounding and lurkingvariables • In an observational study, two explanatory variables can be linked, thus causing the relation to the response to be due to another variable not accounted for: Confounding variables. • Lurking Variables are variables not initially considered in the study but affect the response variable. • Associated does not mean that one causes the other • A simple observational study may find that smoking and cancer are associated • Cannot conclude that smoking causes cancer • Cannot conclude that cancer causes people to smoke • What are some Lurking Variables with Smoking and Cancer?
Types of Observational Studies • Cross-sectional • Case-control • cohort
Cross-sectional Studies Observational studies that collect information about individuals at a specific point in time, or over a very short period of time. Case-control Studies These studies are retrospective, meaning that they require individuals to look back in time or require the researcher to look at existing records. In case-control studies, individuals that have certain characteristics are matched with those that do not. Cohort Studies A cohort study first identifies a group of individuals to participate in the study (cohort). The cohort is then observed over a period of time. Over this time period, characteristics about the individuals are recorded. Because the data is collected over time, cohort studies are prospective.
Census • A census is a list • Of all the individuals in a population • That records the characteristics of the individuals • An example is the US Census held every 10 years (this is only an example though) • Advantages • Answers have 100% certainty • Disadvantages • May be difficult or impossible to obtain • Costs may be prohibitive
Section 1.3 Simple Random Sampling
Simple Random Sample • A simplerandomsample is when every possible sample of size n out of a population of N has an equally likely chance of occurring
Let’s Try It! • 5 Volunteers… • A simple (but not foolproof) method • Write each individual’s name on a separate piece of paper • Put all the papers into a hat • Draw 2 random papers from the hat • Physical methods have some issues • Are the papers sufficiently mixed? • Are some of the papers folded? • What else???
Random Numbers • A method using a table of random numbers • (Back pages Table 1) • List and number the individuals • Decide on a way to pick the random numbers (how to choose the starting point and what rule to use to select which digits to choose after that) • Select the random numbers • Match the numbers to the individuals • With the technology available today, this method is almost silly
Calculator • Randint(start #, end #, how many) • Leave the 3rd entry blank for 1 value • Table 3 Page 25: • Randomly survey 5 of their 30 clients. • Number them 1 – 30 • RandInt(1,30,5) • Survey the clients corresponding to the generated values.
Section 1.4 Other Effective Sampling Methods
Collecting Data • There are other effective ways to collect data • Stratified sampling • Systematic sampling • Cluster sampling • Each of these is particularly appropriate in certain specific circumstances
Stratified Sample • A stratifiedsample is obtained when we choose a simple random sample from subgroups of a population • This is appropriate when the population is made up of nonoverlapping (distinct) groups called strata • Within each strata, the individuals are likely to have a common attribute • Between the stratas, the individuals are likely to have different common attributes
Stratified Sample • Example – polling a population about a political issue • It is reasonable to divide up the population into Democrats, Republicans, and Independents • It is reasonable to believe that the opinions of individuals within each party are the same • It is reasonable to believe that the opinions differ from group to group • Therefore it makes sense to consider each strata separately • Method can help ensure all subgroups are represented so our data is more reliable
Stratified Sample • Example – a poll about safety within a university • Three identified strata • Resident students • Commuter students • Faculty and staff • It is reasonable to assume that the opinions within each group are similar • It is reasonable to assume that the opinions between each group are different
Stratified Sample • Assume that the sizes of the strata are • Resident students – 5,000 • Commuter students – 4,000 • Faculty and staff – 1,000 • If we wish to obtain a sample of size n = 100 that reflects the same relative proportions, we would want to choose • 50 resident students • 40 commuter students • 10 faculty and staff • Finally, conduct a simple random sample within each subgroup to obtain data.
Systematic Sample • A systematicsample is obtained when we choose every kth individual in a population • The first individual selected corresponds to a random number between 1 and k • Systematic sampling is appropriate • When we do not have a frame • When we do not have a list of all the individuals in a population
Systematic Sampling • Example – polling customers about satisfaction with service • We do not have a list of customers arriving that day • We do not even know how many customers will arrive that day • Simple random sampling (and stratified sampling) cannot be implemented
Systematic Sampling • Assume that • We want to choose a sample of 40 customers • We believe that there will be about 350 customers • Values of k • k = 7 is reasonable because it is likely that enough customers will arrive to reach the 40 target • k = 2 is not reasonable because we will only interview the very early customers • k = 20 is not reasonable because it is unlikely that enough customers will arrive to reach the 40 target
Cluster Sample • A clustersample is obtained when we choose a random set of groups and then select all individuals within those groups • We can obtain a sample of size 50 by choosing 10 groups of 5 • Cluster sampling is appropriate when it is very time consuming or expensive to choose the individuals one at a time
Cluster Sample • Example – testing the fill of bottles • It is time consuming to pull individual bottles • It is expensive to waste an entire cartons of 12 bottles to just test one bottle • If we would like to test 240 bottles, we could • Randomly select 20 cartons • Test all 12 bottles within each carton • This reduces the time and expense required
Convenience Sample • A conveniencesample is obtained when we choose individuals in an easy, or convenient way • Self-selecting samples are examples of convenience sampling • Individuals who respond to television or radio announcements • “Just asking around” is an example of convenience sampling • Individuals who are known to the pollster
Convenience Sample • Convenience sampling has little statistical validity • The design is poor • The results are suspect • However, there are times when convenience sampling could be useful as a rough guess
Multistage Sample • A multistagesample is obtained using a combination of • Simple random sampling • Stratified sampling • Systematic sampling • Cluster sampling • Many large scale samples (the US census in noncensus years) use multistage sampling
Section 1.5 Errors in Sampling
Bias • If the results of the sample are not representative of the population, then the sample has bias. • Three Sources of Bias • Sampling Bias • Nonresponse Bias • Response Bias
Sampling Bias • Technique used to obtain individuals tends to favor one part of population over another. • Occurs often in convenience sampling • Often results in undercoverage, proportion of subgroup of population is lower in sample than actual population.
Nonresponse Bias • Occurs when the “nonresponders” to a survey have different opinions than those who do. • Frequent with surveys • Controlled using callbacks or incentives