Introduction to Statistics

Topics 1 - 5 Nellie Hedrick Introduction to Statistics

Statistics Statistics is the Study of Data, it is science of reasoning from data. What does it mean by the term data? You will find that data vary and variability abounds in everyday life. • Observational unit – are the objects described by a set of data. • Variability – phenomenon of a variable taking on different values or categories from observational unit to observational unit. • Quantitative Variables – take the numerical values which numerical operation makes sense. Such as height, weight, time, … • Categorical Variables – places an individual into one of several group or categories. Such as gender, cities in Oklahoma, states in USA, … • Binary variables – categorical variable that can only take two possible outcome. Male/female, Yes/No, … • Research Question – often looks for patterns in a variable or compares a variable across different groups or looks for a relationship between variables

More on Observational Units and Variables: • Distinction between categorical and quantitative variables is very important determines which statistical tools to use for analyzing a given data set. • Determine if data measured either quantitatively or categorically • How many hours you slept in the past 24-hours • Whether you have slept for at least 7 hours in the past 24-hours • Determine a variable that takes numerical values that are really just category labels, such as zip-code, … • Watch out: • to determine whether something is actually a variable, ask yourself whether or not it represents a question that can be asked of each observational unit and • Whether the values can potentially vary from observational unit to observational unit.

More on Wrap up - • Statistics is the science of data • Data are not mere numbers • Data are collected with purpose and have meaning in some context • Fundamental concept of statistics is variability • As we go through the course you will understand to classify variables and determine which statistical tools to apply to the data • Always consider data in context and anticipate reasonable values for the data collected and analyzed • Variable is characteristic that varies from one person to another (observational unit) • Identify variables as categorical, quantitative or binary

Activity 1-6 page 11 Activity 1-9 page 12 Activity 1-13 page 12

Picturing Distributions with Graph • The Distribution of a variable tells us what values it takes and how often it takes these values. We are looking for pattern of variation. • Categorical Variables – places an individual into one of several group or categories. • Quantitative Variables – take the numerical values which numerical operation makes sense • Distribution of a variable – what values it takes and how often,. Topic 2 – Data and Distributions and the Graphing Calculator

Graphical Representations of DataCategorical Variable • Bar Chart

Activity 2-2 hand washing (page 17) In August 2005, researchers for the American Society for Microbiology and the Soap and Detergent Association monitored the behavior of more than 6300 users of public restrooms. They observed people in public venues such as Turner Field in Atlanta and Grand Central Station in New York City. They found that 2393 of 3206 men washed their hands, compared to 2802 of 3130 women. • What proportion of the men washed their hands? What proportion of the women washed their hands? • Are these proportions consistent with the following pair of bar graphs? • Comment on what your calculations and the bar graph reveal about whether or not one gender is more likely to wash their hands after using a public restroom. • For each city, estimate the proportion of people who washed their hands as accurately as you can from the graph. Atlanta: Chicago: New York: San Francisco: • Comment on what the bar graphs reveal about how these cities compare with regard to hand washing.

Activity 2-2 hand washing (page 17) Studying people washing their hand after using restroom • We can look at % of all data collected whether or not they are washing their hands • Look at variation between men and women • Variation between people in different state whether or not washing their hands • Variation between men and women in each state washing their hand

Activity 2-4: Buckle Up (page 19) The National Highway Traffic Safety Administration ( NHTSA) reports the percentage of residents in each state who regularly wear a seatbelt in a car and also whether or not the state has a primary or secondary type of seatbelt law. A primary law means that motorists can be stopped based solely on belt usage, while a secondary law means that the motorist can be stopped only for another reason. The 2005 data appear in the next table ( s secondary, p primary, and * not known): • What are the observational units for these data? • Classify each of the variables in the table as categorical ( also binary) or quantitative. • What would you estimate is a typical usage percentage for a state with a primary- type seatbelt law? How about a state with a secondary- type law? ( Do not perform any calculations; base your answers on a casual reading of the dotplots.) Primary: Secondary: • Does a state with a primary law always have a higher usage percentage than a state with a secondary law? Explain. If not, identify a pair of states for which the state with a primary law has a lower usage percentage than the state with a secondary law. • Do states with a primary law tend to have higher usage percentages than states with a secondary law? Explain how you can tell from the dotplots. • Do the data seem to support the contention that tougher ( primary) laws lead to more seatbelt usage? Can you draw this conclusion definitively? Explain.

Activity 2-4: Buckle Up (page 19) • What type of variable? • Create visual display DOTPLOT, useful method for displaying small datasets of quantitative variable • Label the axis, specially if more than one group • Bar or dot plot usually more illuminating when we are comparing the distribution of variables between two or more groups • Statistical tendency- when comparing 2 or more groups or analyzing dataset • Use words like tend to, on average, lead to in order to express the results.

Watch out and In Brief • Bar or dot plot usually more illuminating when we are comparing the distribution of variables between two or more groups • Statistical tendency- when comparing 2 or more groups or analyzing dataset. But it is not a hard-and-fast rule for categorical and quantitative variables. Be careful with your language. This is also true for cause-and-effect conclusions. • Label your graphs • Be careful, when it is asked proportion(0-1) or percent(0% - 100%) • Bar graph are easier to compare than comparing raw data. • Always relate your comments to the context of the data and ideally to the question of the interest.

Watch out and Wrap up continued • Consistency refer to how variable or spread out, the values in a data sets are for a quantitative variables. • When describing a distribution refer to both center (tendency) and spread (consistency)

Exercises 2-9 page 27 • Exercises 2-12 page 28 Exercises 2-16 page 30

Topic 3: Drawing Conclusions from Studies • Data gives you insight into interesting questions. • Idea of generalizing the results of the study to a larger group than those you used in the study itself. • Population – in a study refers to the entire group of people or objects (observational unit) of interest • Sample – is typically small part of the population from whom or about what data are gathered to learn about the population. If sample is selected carefully (representative of the population) you can learn a lot about the population. • Sample size – the number of observational units (people or objects) studied in a sample. • Sampling Bias – sampling procedures if it tends systematically to over represent certain segments of the population and under represents others.

More Definition – Activity 3-1 page 35 • Convenience samples – sample selected due to convenience of being available. • Voluntary response – sample selected in a such a way that members of the population decide for themselves whether or not to be part of the study. • Non-response – problem could rise when the observational unit does not respond to the study • Sampling frame – list used to select the subjects does not represent all variation in the population • Parameter – number that describe the population (P-P) • Statistics – number that describe the sample (S-S)

Activity 3-1 page 34 Elvis Presley is reported to have died in his Graceland mansion on August 16, 1977. On the 12th anniversary of this event, a Dallas record company wanted to learn the opinions of all adult Americans on the issue of whether Elvis was really dead. But of course they could not ask every adult American this question, so they sponsored a national call- in survey. Listeners of more than 100 radio stations were asked to call a 1- 900 number ( at a charge of $ 2.50) to voice an opinion concerning whether Elvis was really dead. It turned out that 56% of the callers thought that Elvis was alive. This scenario is very common in statistics: wanting to learn about a large group based on data from a smaller group.

Activity 3-1 page 34 (cont) • In 1936, Literary Digest magazine conducted the most extensive ( to that date) public opinion poll in history. They mailed out questionnaires to over 10 million people whose names and addresses they had obtained from telephone books and vehicle registration lists. More than 2.4 million people responded, with 57% indicating they would vote for Republican Alf Landon in the upcoming presidential election. ( Incumbent Democrat Franklin Roosevelt won the actual election, carrying 63% of the popular vote.)

More Definition – Activity 3-4 page 39 • Explanatory variable – The variable whose effect you want to study. • Response variable – The variable that you suspect is effected by the other variable, explanatory variable • Observational Study – when researcher passively observe and record information about observational units. • Lurking variables – when observational does not includes the possible effects of a variable. Unrecorded variable is called lurking variable. • Confounding variable – is a lurking variable whose effects on the response variable indistinguishable from the effects of the explanatory variable.

Activity 3-4 page 39 • Exercise 3-8 page 46

Wrap Up:Key questions to consider • What are the two things can prevent you from drawing certain conclusion in the study? • Bias and compounding • To what population can you reasonably generalize the results of a study? • Depends to how you have selected your data • Can you reasonably draw a cause-and-effect connection between the explanatory and response variables? • Depends on whether or not explanatory variable was assigned to the observational units

Topic 4 – Random Sampling • One way to avoid a biased sampling method is to give every member of the population the same chance of being selected for the sample. • Your selection method should ensure that every possible sample (of the desired sample size) has an equal chance of being the sample ultimately selected. • Such a sampling is called Simple Random Sampling (SRS) • Unbiased – A statistic is said to provide unbiased estimates of a population parameter if values of the statistics from different random samples are centered at the actual parameter value

Definition • Sampling variability – an important statistical property knows as sampling variability refers to the fact the values of sample statistics vary from sample to sample. • Precision – of a sample statistics refers to how much the values vary from sample to sample • The bigger the sample size the more precise and closer together than those with the smaller sample size • Statistics provides more accurate estimate of the corresponding population parameter

Activity 4-1 page 54Activity 4-2 page 57Exercise 4-18 page 73

Wrap up • Do not confuse the difference between sample size and the number of sample done in a study. • Although the role of the sample is crucial to assessing how a sample statistics varies from one sample to sample. • The size of the sample will not effects the sampling variability. • As long as the population is large relative to the sample size (at least 10 times as large), the precision of a sample statistics depends on the sample size and not on the population size.

Topic 5: Designing Experiments • SELF STUDY • QUIZ – Assignment 1

Introduction to Statistics