340 likes | 540 Views
Math 1231: Spring 2014 Chapter 1, Part One. Course Introduction 1.2: Statistical Thinking 1.3: Types of Data. Recent Real-World Statistics. Average 2012 Wage for GA: $ 43,310 [U.S. Bureau of Labor Statistics, Dec. 2013].
E N D
Math 1231: Spring 2014 Chapter 1, Part One Course Introduction 1.2: Statistical Thinking 1.3: Types of Data
Recent Real-World Statistics Average 2012 Wage for GA: $43,310 [U.S. Bureau of Labor Statistics, Dec. 2013]. 91% of American adults own a cellphone. 56% of American adults own a smartphone [Pew Internet Project, May 2013]. 22% of Americans think the new healthcare law will make their family’s healthcare situation better. 37% think the law will make their situation worse. [Gallup, 1/10/14].
More Real-World Statistics Gallup data from Jan.-July 2013: Among Americans who are “actively disengaged” from their job, 18% smoke. Among those who are not “actively disengaged” from their job, 15% smoke. Question: Are these percents “significantly different” from one another? Could we conclude that those who are “actively disengaged” are more likely to smoke?
Links for Previous Examples • BLS Salary Data: • http://http://bls.gov/oes/current/oes_ga.htm#00-0000 • Pew Cellphone/Smartphone Ownership: • http://www.pewinternet.org/Trend-Data-%28Adults%29/Device-Ownership.aspx • Gallup Healthcare Law Survey: • http://http://www.gallup.com/poll/166793/americans-say-health-law-harmful-helpful.aspx • Gallup Smoking vs. Job Satisfaction Data: • http://www.gallup.com/poll/164162/americans-hate-jobs-likely-smoke.aspx
Some Interesting Questions… Gallup’s poll on the healthcare law was conducted by phone between Jan. 3 - 4. Were you contacted for this poll? Gallup asked “only” 1020 American adults. There are more than 230 million total. Question: How can Gallup make such a claim about all American adults, using information from such a small percentage?
A Typical Problem The previous scenarios illustrate a very common situation in real-world statistics: We have a VERY LARGE set of individuals (all American adults, for example). It is EFFECTIVELY IMPOSSIBLE to gather information from every single individual. By gathering information from a small number of individuals, we can draw some conclusions about the VERY LARGE set.
Statistical Thinking Section 1.2
*** Basic Terminology *** Population: The complete set of individuals that we would like to study (Gallup Healthcare Poll: All American adults—this is more than 230 million individuals!). Data: Collections of observations about members of the population (Gallup: Each individual’s opinion about the healthcare law) . Sample: A subset of individuals for which we have obtained data. The Sample is a subset of the Population. (Gallup: 1020 adults, contacted by telephone during Jan. 3 - 4, 2014).
Descriptive / Summary Statistics • Goal: Take a LARGE amount of data and organize/summarize it in a useful way. • Example: Your current GPA summarizes your academic performance, without needing to see your entire transcript. • Is this a fair/accurate summary? It doesn’t matter, many people will use it anyway. • We will look at several different graphical and numerical summaries over the next few classes.
Statistical Inference Goal: Use (known) data from a sample in order to draw conclusions about the entire population. Example: Based on data from 1020 adults, Gallup claims that between 33% and 41% of all American adults believe that the new healthcare law will make their family’s situation worse. It may surprise you that such a small amount of data can be used to make a conclusion about a large population. We’ll discuss the underlying methods later in the course. For now: It is MORE IMPORTANT that the sample data are collected in an “appropriate” way, otherwise our methods will give inaccurate results.
Data: Important Considerations Context: What do the data represent? The same numbers can have completely different meanings/interpretations:
Data: Important Considerations Source: Where do the data come from? Who gathered the data? Who summarized or analyzed the data? Who sponsored or funded the research? Are those responsible for collecting/analyzing the data reliable? Is there any incentive to distort results and/or favor a particular type of result?
Data: Important Considerations Sampling Method: What process was used to choose the sample and collect data? Was sample selection limited to individuals who volunteered to provide data? Was sample selection limited to individuals who were convenient? Was data collection based on subjective judgment or ambiguous terminology? Example: Do you spend a lot of time studying?
Important Considerations Conclusions: What are the results of the statistical analysis/inference? What is the intended population? Are the results valid for the entire population? Can you restate results in a way that can be understood by someone with no little or no knowledge of statistical terminology? Is there a cause-and-effect relationship, or merely a statistical relationship (“Correlation does not imply causality”—see Chapter 10).
Some Other Considerations Practical Implications: Are the conclusions useful or relevant in a real-world context? A “Statistically Significant” claim comes from analyzing the data using numerical methods, without any context (see the next slide). “Practically Significant” means useful or relevant to the real world. These are not necessarily the same thing!
Statistical Significance When doing statistical inference, there is always some degree of randomness in how we gather the sample data. If we wind up with results that are unlikely to occur by random chance, we say the results are statistically significant. Simple Example: How likely is it that a fair coin would come up heads in 95/100 flips?
Example: Class Attendance In analyzing grade data among Statistics students from a previous semester, the average course grade for students with “many” absences was 15 points (out of 100) less than the average for students with “few or no” absences. My Claim: Students with “many” absences tend to have lower course grades than students with “few or no” absences. Is this statistically and/or practically significant? Is frequent absence the cause of lower grades?
Types of Data Section 1.3
*** Parameters vs. Statistics *** The goal of statistical inference is to use sample data to draw conclusions about some VERY LARGE population. A parameter is a numerical value describing some aspect of the population. A statisticis a numerical value describing some aspect of a sample. The value of a statistic (computed from sample data) can be used to estimate the value of a parameter (almost always unknown).
Parameters vs. Statistics Example: I want to estimate the average height of all students currently in class. I choose four students “at random” and compute the average height for those four. Average class height: This is a parameter, its value is unknown to us (the population is the entire class). Average height for the group of four: This is a statistic (the sample consists of these four students). Question for later: Is it reasonable to claim that the sample average is “close” to the population average, based on our sample?
Quantitative vs. Categorical Quantitative data consist of number that represent counts or measurements. All quantitative data is numerical, but not all numerical data is quantitative. Data with a unit of measurement (seconds, feet, pounds, dollars, etc.) is quantitative. Numerical data used as a label or range of values (Student ID Number, 20-25 years) is not quantitative.
Examples: Quantitative Data The University keeps the following quantitative data about each student. Grade Point Average Number of Credit Hours Completed Age Amount of money owed for tuition Other examples?
Categorical Data Data that are not quantitative are called categorical. Non-numerical data must be categorical. Numerical data that serves to label or identify individuals are categorical (Example: Social Security Number). A useful guide: Would it make sense to consider an average value? If not, treat the data as categorical.
Examples: Categorical Data The University keeps the following categorical data about each student: Name Laker ID Number Date of Birth Gender Residency (“in-state” or “out-of-state”) Other?
Discrete vs. Continuous Quantitative (number) data can be classified as: Discrete: Finitely many possible values, or infinitely many values with clearly-defined “next” and “previous” values. Discrete values can be put into a list. Continuous: Infinitely many values anywhere in a given range/interval, with no holes or gaps. A useful guide: Is it theoretically possible to make your measurements more accurate/precise? If so, then you probably have continuous data.
Examples: Discrete or Continuous? Number of siblings. Amount of time it takes to run one mile. Resting pulse rate (beats per minute). Distance you live from this building. Grade point average. Credit card balance (in dollars/cents). Note: The answers may depend on how the data are measured and/or used.
Levels of Measurement An alternate way to classify data, based on what can be done to summarize/analyze it. There is some debate on how many levels are needed; these four are commonly used: Nominal (qualitative) Ordinal (ordering is meaningful) Interval (differences are meaningful) Ratio (ratios are meaningful)
Nominal Level Consists of names, labels, or well-defined categories. There is no meaningful way to order values (alphabetical is often used). Colors (Red, Green, Yellow, etc.) Gender (Female, Male) Party Affiliation (Democrat, Republican, Other) State of Residence Nominal data is always categorical.
Ordinal Level Data can be arranged in some meaningful order, but differences between values cannot be computed or are useless. Course Grades (A, B, C, D, F) Competitive Rankings (Gold > Silver > Bronze, but “Gold minus Silver” is useless, even if we represent these as numbers 1, 2, 3). Ordinal data is often categorical (notable exceptions are IQ and Body Mass Index).
Interval Level Numerical values that can be put in order, and the difference between two values has some useful meaning. However, there is no “natural zero” level and ratios do not have any practical meaning. Examples: Temperature (Fahrenheit or Celsius): 15 is colder than 30, but zero degrees does not mean an absence of temperature (unless you use Kelvin). Calendar Data: Aug. 7th < Aug. 21st, with a difference of 14 days (but the 21st is not “three times” the 7th). Interval data is the least common of the four levels.
Ratio Level Numerical values that can be put in order, the difference between two values has meaning, and there is a natural, non-arbitary “zero level.” Ratio data measures “amount of stuff.” The zero level means that “no stuff” is present. Distance, amount of time, mass/weight, many other physical quantities. Price, Checking account balance, many other monetary quantities. If “twice as much” or “half as much” make sense, then you have ratio data. Ratio data is always quantitative.
Examples Classify the following data (about students): • Age • Year of birth • Academic major • Weight • Transfer student? (yes/no) • Currently seated in which row? • SAT score
Examples Answers to the previous slide: • Age: Quantitative, Discrete(?), Ratio • Year of birth: Quantitative(?), Discrete, Interval • Academic major: Categorical, Nominal • Weight: Quantitative, Continuous, Ratio • Transfer student?: Categorical, Nominal • Current row?: Categorical, Ordinal • SAT score: Quantitative, Discrete, Ordinal(?) 08/15/11