640 likes | 659 Views
Learn the basic definitions and concepts used in statistics, including data analysis methods, populations, samples, parameters, and statistics. Explore the differences between quantitative and qualitative data, and understand the importance of statistics in drawing conclusions. Master statistical analysis techniques to make informed decisions based on data.
E N D
Introduction to Statistics Section 1A William Christensen, Ph.D.
“If I have ever made any valuable discoveries, it has been owing more to patient attention, than to any other talent.” Isaac Newton (1642 - 1727) tells you how to succeed in this Statistics class.
What is Statistics? Two Meanings • A “Statistic” can be a Specific Number • A “Statistic” can be a Method of Analysis
A “Statistic” can be a Specific Number • A number that represents some measure of a set of data • Example: The average hourly wage in Washington County is $15 / hour • Example: 23% of the people polled believe there are too many polls
A “Statistic” can be a Method of Analysis • Methods of analysis have to do with: planning experiments, collecting data, organizing & summarizing data, analyzing & presenting data, and interpreting & drawing conclusions from data • Example: Linear regression is one type of statistical analysis that is used to examine the relationship between things (variables)
Data – factual information • factual information (as measurements or statistics) used as a basis for reasoning, discussion, or calculation
Population - the complete collection (every member) of the things we are studying. • Example: If we are studying Grizzly Bears then the “population” is every Grizzly Bear everywhere. • DO NOT confuse “population” with a “sample”. We rarely have data on every member of a population, so we often use “statistics” (methods of analysis) to analyze a “sample” in order to understand things (make inferences) about the population. For example, we might study 12 grizzly bears (a “sample”) in order to make inferences about all grizzly bears (the “population”)
Census - the collection of data from every member of a population. • Although it is rare, due to the time and expense involved, to collect data from every member of a “population,” when and if we do, this collection of data is called a “Census”. • Example: Once every decade the U.S. government conducts a census of its citizens, attempting to collect data from everyone that lives in the United States.
Sample – a subcollection of members drawn from a “population” and used to draw conclusions or make inferences about the population. • In statistics, the most common approach is to use data from a “sample” in order to make inferences or draw conclusions about the larger “population”. • Example: We might give an experimental drug to 1,000 people (our “sample”) in order to draw conclusions about what would happen if we made the drug available to all people everywhere (our “population”).
Parameter – a numerical measurement describing something about an entire “population”. • Example: If our “population” consists of all DSC students, then one “parameter” would be the average GPA of all DSC students. Another “parameter” would be the percentage of female students among all DSC students. As long as we are talking about the entire “population” we are interested in, any numerical measurement (e.g., average, percentage, etc.) that describes something about the entire population (not just a sample from the population) would be considered a “parameter”.
population parameter
Statistic – a numerical measurement describing something about a “sample” (see definition of sample). • Example: If our “sample” consists of 30 DSC students, then one “statistic” would be the average GPA of our sample of 30 DSC students. Another “parameter” would be the percentage of female students among our sample of 30 DSC students. As long as we are talking about only our sample, any numerical measurement that describes something about that sample (not the entire population) would be considered a “statistic”.
sample statistic
What is the difference between a “parameter” and a “statistic”? • If you don’t know, then GO BACK AND REVIEW
Quantitative Data – numbers that represent counts or measurements. • If you can express the data as a number then it is usually (not always) quantitative data • Example: income levels, ages, weights, and lengths can all be expressed as meaningful numbers. These are all examples of quantitative data. • Gender, opinions, and relationships cannot be expressed as numbers and ARE NOT quantitative data. • Even some numbers, such as zip codes or phone numbers ARE NOT quantitative data because you cannot mathematically manipulate them (e.g., add or subtract them) • Quantitative Data can be subdivided into two groups: • Discrete data • Continuous data
Quantitative Data • Discrete data – when the number of possible values is ‘countable’ or finite • Example: The number of eggs a chicken lays is “discrete”. You get 1, 2, 3, or more eggs. The number of eggs is always finite. You can never get an in-between number of eggs, like 1.23654 eggs. • Continuous data – when the number of possible values is infinite. Scales that cover a range of values, without gaps, produce continuous data. • Example: a thermometer is an example of a scale without gaps that covers a range of temperatures. There can be an infinite number of temperatures between say 0 degrees and 120 degrees Fahrenheit. This is because you can have any number of in-between temperatures such as 101.54658 degrees.
Qualitative Data – information that can be put into categories or distinguished by some nonnumeric characteristic. Examples: • Gender (male/female) • Age categories (not individual ages, but age brackets) • Party affiliation (democrat/republican/independent) • Zip codes • Social security numbers
Qualitative (or categorical or attribute) data • can be separated into different categories that are distinguished by some nonnumeric characteristics (e.g., male / female)
Nominal data • Data that consists of names, labels, or categories • Cannot be arranged in any meaningful order • You cannot say that one value is bigger, better, or greater than any another value • Examples: • Gender (male/female) • Party affiliation (democrat / republican / independent) • Zip codes
Ordinal data • Data that may be arranged in some order, but the precise differences between values either cannot be determined or are meaningless • Examples: • Poor / Average / Good / Excellent • Letter grades ( A, B, C, D, F ) • Subcompact, Compact, Mid-size, and Full-size Automobiles
Interval data • Like ordinal data but with the additional property that the difference between any two data values is meaningful (evenly spaced). However, there is no natural zero starting point at which there is zero quantity. • Example: Calendar years. The difference between the year 2000 and the year 1990 is the same as the difference between the year 1990 and the year 1980 (a difference of 10 years in each case). We can add and subtract years. However, we CANNOT SAY that the year 2000 is two times (or twice as much time) as the year 1000. Also, the year 0 does not represent the starting point of time. Fahrenheit temperature is another example of an interval scale because 100 degrees F is NOT twice as hot as 50 degrees F, and 0 degrees F does NOT represent the absence of all heat.
Ratio data • Like interval data with the addition that there is now an absolute or true starting point where zero truly means there is no quantity present. • Ratio data can be fully manipulated using mathematics. We can add/subtract or multiply/divide, or whatever. • Examples: • Money (e.g., prices of college textbooks). $50 is half of $100 and $0 is truly zero or no money. • Distance (e.g., miles from home to school). 20 miles is really twice as far as 10 miles and zero distance is truly no distance.
ratio level of measurement the interval level modified to include the natural zero starting point (where zero indicates that none of the quantity is present). For values at this level, differences and ratios are meaningful. Example: Prices of college textbooks
Can you name and explain the 2 types of data and the 4 levels of data measurement? • If not then GO BACK AND REVIEW
“There are three kinds of lies: lies, damned lies, and statistics.” Benjamin Disraeli (1804-81) British Prime Minister
Why Statistics? • Statistics is used to explore and explain things not explainable by the physical sciences. Things like: • Human behavior • Useful for understanding marketing, business, consumer behavior, social science, psychology and politics • Nature • Useful for understanding natural phenomenon and animal behavior • Medicine • Because we are all different, the physical science alone cannot fully explain how are bodies react to drugs and other stimuli, so statistics plays a valuable role in medicine
Why Statistics? • #1 Practical Reason • People will try to sell you all kinds of things using statistics, from vitamins to investments to political agendas • If you do not have a working knowledge of statistics you are fodder for the merciless.
Bad Statistics • The self-selected survey (voluntary response sample) • respondents themselves decide whether to be included • For example, you get an email asking you to respond to a survey on something (often in order to save the world for evil) • Studies that use small samples and/or samples that do not provide a true representation of the population being studied
Bad Statistics • Surveys with confusing or misleading questions • Here is an example of two different ways to ask the “same” question and how the phrasing of the question can alter the response. Is it really the same question being asked? • Should the president have the line item veto? (response was 57% yes) • Should the president have the line item veto to eliminate waste? (response was 97% yes)
Bad Statistics • Presentations that include misleading data and graphs • Take a look at the following graphs and pictures and see if you can find the deception in each • Hint: focus on the numerical information given, which is usually accurate, versus the general shape of the graph/picture which often misleads us.
$40,500 $40,500 $40,000 $40,000 30,000 35,000 $24,400 30,000 20,000 $24,400 25,000 10,000 20,000 0 Bachelor High School Degree Diploma Bachelor High School Degree Diploma Salaries of People with Bachelor’s Degrees compared to those with only High School Diplomas (same graph but using different starting points – how does this effect your perception?)
The caption for this graph might suggest that savings have doubled (i.e., save twice as much) Yet, in fact, by doubling the length, width and height, the actual size (volume) has increased by a factor of eight!!
Bad Statistics • Using precise numbers that are not accurate • For example, if I gave you a number like 1257391.546 you might assume it was quite accurate because it seems so precise (not rounded). Yet, in fact, this number may have been generated from very poor/inaccurate data. Don’t be fooled, investigate the actual data.
Bad Statistics • Distorted percentages • For example, if an airline was losing 80% of the luggage they processed and they improved this to “only” losing 40%, they might claim something like a 100% improvement in baggage handling. This may sound very impressive and make you think they are doing a great job. Yet, in fact, if they are still losing 40% of passenger’s luggage they are actually doing a very poor job. Don’t fall for misleading statistics, especially when someone is trying to sell you something.
Bad Statistics • Partial Pictures (not the whole story) • Example: An overseas automaker makes the (true) claim that “90% of all our cars sold in the USA in the last 10 years are still on the road.” • This claim is designed, of course, to make you think they have really good quality cars. And, in fact, there stated claim is true. However, what they don’t tell you is that they have only been selling cars in the USA for the last 3 years. • Watch out for misleading statistics, especially when someone is trying to sell you something.
Bad Statistics • Deliberate Distortions or outright lies • Unfortunately, there are people willing to tell outright lies and use statistics to make those lies appear legitimate. If you are interested in learning how to better detect these untruths, here are some references you can check out. • Tainted Truth by Cynthia Crossen • How to Lie with Statistics by Darrell Huff • The Figure Finaglers by Robert Reichard
What’s Wrong Here? • In a study of college campus crimes committed by students high on alcohol or drugs, a mail survey of 1875 students was conducted. A USA Today article noted, “8% of the students responding anonymously say they’ve committed a campus crime, and 62% of that group say they did so under the influence of alcohol or drugs.” • Hint: They never told us the actual number of students who responded to the survey. By telling us they sent it to 1875 students it makes it sound like they had a large sample, but in fact, they never said how many responded. What if only 5 or 6 students responded – would that be representative of the population of all college students? Also, they use a percentage of a percentage (62% of 8% = 5%), but it kind of fools you, at first, into thinking that 62% of college students committed a crime while under the influence, while in fact only about 5% of those responding to the survey said they committed a crime while under the influence. This actually appeared in USA Today, yet it is quite misleading. Perhaps you can find even more things wrong with this study (e.g., what constitutes a “crime”? Does it include parking tickets?).
What’s Wrong Here? • The Newport Chronicle, a newspaper in New England, reported that pregnant mothers can increase their chances of having healthy babies by eating lobsters. That claim is based on a study showing that babies born to lobster-eating mothers have fewer health problems than babies born to mothers who don’t eat lobster. • Hint: In statistics we can “prove” a relationship between two things, in this case, healthy babies and lobster-eating mothers, but that DOES NOT mean that one causes the other. In fact, this study did show a statistical relationship, but as we will learn later in the course, that relationship does NOT IMPLY causality. Can you think of any reasons why lobster-eating mothers might have healthier-than-average babies besides the fact that they eat lobsters? One theory that might explain this relationship is that lobster is quite expensive and therefore, those that eat lobster are probably well-off and can probably afford the best health care. This might be a better explanation of the results than to suggest that eating lobsters is good prenatal care. Perhaps you can think of other reasons for these results.
What’s Wrong Here? • A survey includes this item: “Enter you height in inches _____” • What might be some of the problems in asking this question? • Hint: You can probably think of a number of reasons yourself. One problem might be that people usually think of their height in terms of feet and inches (e.g., 5’ 10”) and may have trouble figuring out how many inches that is. Also, many people tend to exaggerate their heights, so if you were really looking for accurate information, you might want to actually measure people rather than ask them how tall they are.
What’s Wrong Here? • True story: A researcher at the Sloan-Kettering Cancer Research Center was once criticized for falsifying data. Among his data were figures obtained from six groups of mice, with 20 individual mice in each group. These values were given for the percentage of successes in each group: 53%, 58%, 63%, 46%, 48%, 67%. • How did someone figure out that this researcher lied about their data? I’ll let you figure this one out on your own. You can check with me if you like to see if you got it.
Where do we get data? • Data usually comes for one of two sources: • Observation • Experiments
Data from Observation • The “observation” method of data collection suggests that we only observe what we are studying, without in any way trying to change or effect it. • For example, we might use the “observation” method of study by making a survey and asking people in a mall questions about what kind of pizza they like best and why. • Surveys and actually sitting there and watching things are types of “observation”
Data from an Experiment • An “experiment” requires that we apply some sort of treatment and then see what kind of effect we get. Experiments usually involve two groups; the treatment or test group, and the placebo or control group. (if you don’t understand what these or any other terms mean then you need to look them up in the dictionary) • Example: we conduct an experiment to test the effectiveness of a new drug by giving the drug to one group (the treatment or test group), and giving a “sugar pill”, or something that looks like the drug but really isn’t any kind of drug at all, to another group called the placebo or control group. We then see if the test group did significantly better than placebo group. • Experiments need to be carefully planned or designed in order to get accurate results.
Designing an Experiment • Steps in designing an experiment • Identify your objective and identify the relevant population • Collect (representative) sample data from your test group and placebo group • Use a random procedure in selecting subjects for your treatment and placebo group to avoid bias • Analyze the data and form conclusions
Controlling for Effects • When doing an experiment we must be careful to avoid any interference from things outside of what we are studying. • This basically means that the ONLY thing we want to be different between our test group and our control group is the treatment itself. We don’t want anything (physical or psychological) to interfere. • These “effects” or unwanted interference can be controlled through good experimental design