Introduction to Statistics

Introduction to Statistics Prof. L Prado OER www.helpyourmath.com

Chapter 1 • Overview • Nature of data • Skills needed in statistics The science of statistics is Collecting, Organizing, Summarizing, Analyzing information to draw conclusions from data or answer questions.

Statistical Methods Descriptive Inferential Statistics Statistics Hypothesis Estimation Testing

Overview Statistics: • Descriptive • Collection,organization sumarization, and presentation of data. • Inferential • Draw conclusions with respect a population by using samples. Draw = Infer Survey: tool to collect data from a smaller group which is part of a larger group to learn something about the larger group Key goal of statistics: Learn about a large group (population) from data from a smaller subgroup (sample)

Overview Definitions: • Variable: It’s a characteristic or attribute that varies. • Data: are the values for the variable collected (measurements,observations: gender, answers,…). • Statistics: collection of methods to study data • Population: complete collection of all subjects (individuals, scores, measurements,…) • Sample: subcollection of members selected from a population. • Census: collection of data from every member of the population. (ex. US-Census).

Overview Example: • Poll: 1087 adults are asked whether they drink alcoholic beverages or not. • Sample: 1087 adults • Population: US adults 150 million. • Census: Every 10 years, the census bureau tries to collect information from every member of the US population. • Impossible! • Very expensive! (time and money) • Use sample data to draw conclusions from whole population: inferential statistics!

Parameter: • A numerical measurement describing some characteristic of the population. • Lincoln elected: 39.82% of 1,865,908 votes counted. • 39.82% is a parameter. Statistic: • A numerical measurement describing some characteristic of the sample. • Based on a sample of 877 elected executives, 45% would not hire an applicant with a typographical error in the application. • 45% is a statistic.

Types of data Quantitative data: Numbers representing counts or measurements. Number of children in a family,Weights, Heights,ages. Qualitative data (Categorical data): Nonnumerical. Gender of an athlete, Zip code, Blood type, States in the U.S., and brands of TV. Discrete(count) variable vs. continuous (measure) variable # of people in a household vs. temperatures in May. Nominal level of measurement: names, labels categories: no ordering. Yes/No/Undecided responses, colors,gender,jersey numbers of players. Ordinal level of measurement: some order(rank), but numerical values meaningless or nonexistent. grades A, B, C, D, F.,intensity of pain(none,mild,moderate, severe) Interval level of measurement: order, but “no 0” or meaningless. Temperature, year, IQ score. Ratio level of measurement: Interval level with meaningful zero. Weights, prices (non-negative), number of phones calls received.

Summary • The process of statistics is designed to collect and analyze data to reach conclusions • Variables can be classified by their type of data • Qualitative variables: Nominal or Ordinal. • Quantitative variables: Discrete: (values counted) Continuous:(values measured)

Basic skills Samples: • representative: • “39/40 polled people vote for A” Sampled in A’s headquarters! • Not too small: • CDF published “among HS students suspended, 67% suspended more than 3 times” Sample size: 3! Graphs: In which one does red do better? Percentage of: • 6 % of 1200 = 6 / 100 * 1200 = 72 Fraction >>> percentage: • 3/4 = 0.75 >>> 0.75 * 100% = 75 % Percentage >>> decimal: • 27.3% = 27.3/100 = 0.273 Decimal >>> percentage: • 0.852 >>> 0.852 * 100% = 85.2% • `

Basic skills 2 Calculator:

Statistical Study Observational study: observe and measure characteristics without trying to modify individuals. • Gallup poll, Nielsen Media poll (TV shows). • Cross-sectional: data observed, measured at one point in time. • Retrospective: data are collected from the past (records) • Prospective: data collected along the way from groups (Smokers/Non-Smokers) Experiment: apply treatment to individuals and observe and measure effects. • Clinical trial for Lipitor. • Treatment group(Lipitor) and Control group (placebo group) • Control: comparison, single-blinding , double-blinding, placebo,blocks • Replication: ability to repeat the experiment • Randomization: data needs to be collected in an appropriate (random) way, otherwise it is completely useless!

A completelyrandomizeddesignis when each experimental unit is assigned to a treatment completely at random • An example • A farmer wants to test the effects of a fertilizer • We choose a set of plants to receive the treatment • We randomly assign plants to receive different levels of fertilizer • This has similarities to completely random sampling

We control as many factors as we can • Amount of watering • Method of tilling • Soil acidity • Randomization decreases the effects of uncontrolled factors • Rainfall • Sunlight • Temperature

A randomizedblockdesignis when the experimental units are grouped and then each group is assigned a treatment at random • The groups are called blocks • This design will reduce confounding • This has similarities to stratified sampling Remark: When two effects cannot be distinguished, this is called confounding

Summary • The planning for designed experiments is crucial to the success of the experiment • A double-blind implementation of experiments reduces the amount of changes in behavior • There are different good methods for assigning treatments to experimental units • Completely random • Randomized blocks • Matched-pairs (I skipped!)

Sampling Design Sampling: • Simple random sample(SRS) of size n : every possible random sample of size n individuals has the same chance of being chosen. • Note an SRS also gives for each individual an equal chance to be chosen (thus avoiding bias in the choice) • systematic: select starting point and every kth member chosen. • convenience: use easy to get data. • stratified: subdivide population into at least 2 subgroups with common characteristic(homogeneous) and draw samples from each (e.g. gender age, animal species,) • cluster: divide population into areas and draw samples form clusters(intact groups representative of the population) (ex. The city blocks, geografic areas) Sampling error: the difference between a sample result and the true population result; results from chance sample fluctuations Nonsampling error: occurs when data is incorrectly collected, measured, recorded or analyzed.

Summary • There are other sampling methods that are particularly useful in certain situations • Stratified sampling to cover the different strata • Systematic sampling when the frame is unknown • Cluster sampling to reduce the time and expense required • Multistage sampling for effective large scale samples • The choice of sampling methods depends on the structure of the population and the goals of the analyst

Sources of Error In Sampling • One type of error, samplingerrors, occur because we use only part of the population in our study • Samples consist of only part of the total data • Samples are usually more realistic to analyze • Because there are individuals in the population that are not in our sample, sampling errors are difficult to control • We will study sampling errors in future chapters

Types of nonsampling error • Using an incomplete frame • Individuals who respond have different characteristics than individuals who do not respond • Interviewer errors • Misrepresented answers • Data checks • Questionnaire design • Wording of questions • Order of questions, words, and responses

Another type of error, nonsamplingerrors, occur from the actual survey process • Preference is given to selecting some individuals over others • Individual answers are not accurate (for various reasons) • Nonsampling errors can often be controlled or minimized with a well-designed survey and sampling technique

The Literary Digest used their polls to predict the winner of presidential elections • Their previous polls were accurate • In 1936, the Literary Digest predicted that Alf Landon would defeat Franklin Roosevelt in a landslide • In the actual election, Roosevelt won in a landslide

Why was the Literary Digest so far off? • The 1936 frame was not representative of the total voting population • The sampling process was not completely random • The frame had too large of a proportion of Republicans, who generally favored Landon • The frame had too small of a proportion of Democrats, who generally favored Roosevelt • Republicans were overrepresented and Democrats were underrepresented!

Introduction to Statistics