540 likes | 565 Views
Misconception – Jessica Utts, UC Irvine. “Statistics is a boring subject and has little relevance in daily life, so it does not matter if you remember anything you learn about it.” – Anonymous Student (not from Stat 226!). Statistics affects our lives in ways most people never realize.
E N D
Misconception – Jessica Utts, UC Irvine • “Statistics is a boring subject and has little relevance in daily life, so it does not matter if you remember anything you learn about it.” – Anonymous Student (not from Stat 226!)
Statistics affects our lives in ways most people never realize. • Cost of insurance for your car • When new movies will be released • Whether a store has your size in stock • Best way to grow, process, ship and sell food • Risk assessment for a credit card company • Determining if email is spam (i.e. spam filters) • What type of calling plan for your phone will appeal to you and makes money for the company • Which incoming students are most likely to succeed at ISU (based on High School GPA, ACT/SAT, etc.)
Chapter 1 Examining Distributions
Introduction • Statistics is the science of collecting, organizing, and interpreting data in the presence of variation • The fact that variation exists is the reason why you are taking this class (in addition to “my advisor told me to”) • Statistics aids us in finding the truth.
Introduction • Steps for Statistical Problem Solving • Question Formulation: Articulate a research question or a hypothesis to be tested • Data Production: Collect defensible and relevant data. • Data Summarization: Graph data and compute numerical summaries. • Statistical Inference: Draw conclusions about how results apply in a broader context. (We will talk about this in later chapters)
Definitions • MeasurementThe value of a variable obtained and recorded on an individual • Examples: 145 recorded as a person’s weight 65 recorded as the height of a tree “purple” as the color you dyed your dog’s hair • Data is a set of measurements made on a group of individuals
The Three W’s • Any set of data is accompanied by background information that helps us understand the data • Three questions to ask when planning a statistical study or exploring data from someone else’s work 1) Who?: Individuals 2) What?: Variables 3) Why?: Purpose
Definitions • Individuals are the objects described by a set of data • Employees, lab mice, states… • A variable is any characteristic of an individual • Age, salary, weight, location… • The purpose is why you have the data (or why someone is giving you money to do research) • Cure cancer, marketing survey, time machine…
Two Types of Variables • A categorical variable places an individual into one of several groups or categories • A quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense
**The distribution of a variable tells us what values the variable takes and how often it takes those values • Ex. • The way we visualize the distribution depends on the type of variable...
Recognizing the type • Age: • Gender: • Race: • Salary: • Job Type:
Section 1.1 Displaying Distributions with Graphs
Introduction • Statistical tools and ideas help us examine data in order to describe their main features • This is called exploratory data analysis • We first want to simply describe what we see • Basic strategy to help us organize our exploration • Step 1: Examine variables one by one, • Step 2: Look at the relationships among variables • In both steps, we begin by visualizing (with graphs/pictures), then we focus on specific aspects of the data using the appropriate numerical summaries
Categorical Variables • The values of a categorical variable are labels for the categories • examples: Gender → male or female Location → NE, SE, SW, NW or MW • The distribution of a categorical variable lists the categories and gives either the count(frequency) or percent (aka proportion or relative frequency)of individuals who fall into each category • Three types of graphs • Bar, Pie, and Pareto
Summary Table • How we will summarize categorical data is by using a table. Note that proportions (a.k.a. percentages in decimal form) are often called Relative Frequencies.
Bar Graph • The bar graph quickly compares the degrees of the four groups • The heights of the four bars show the counts for the four degree categories
Pie Chart • The pie chart helps us see what part of the whole each group forms • To make a pie chart, you must include all the categories that make up the whole • What if I don’t have color?
Pareto Chart • A bar graph whose categories are ordered from most frequent to least frequent is called a Pareto chart • Pareto charts identify the “vital few” categories that contain most of our observations • Many categories
Brand Preference Example The distribution:
Summary for Categorical Variables • Bar graphs, pie charts, and Pareto charts help an audience grasp a distribution quickly • Bar graph is nearly always preferable to a pie chart. It is easier to compare bars than slices of pie • (although comparing pie is much tastier...) • These graphs are of limited use for data analysis because it is usually easy to understand categorical data on a single variable without a graph
Quantitative Variables: • Graphical Summary • Histogram • Stem plot • Time plot
Creating a Histogram • A histogram is the most common way to graph a quantitative variable • Step 1: • Divide the range of the data into classes of equal width • Be sure to specify the classes precisely so that each individual falls into exactly one class • Step 2: • Count the number of individuals in each class • Step 3: • Draw the histogram by making the heights of the bars for each class equal to the number of individuals that fall in that class
Creating a Histogram • The vertical axis contains the scale of counts (or percents), and each bar represents a class • The base of the bar covers the class, and the bar height is the class count
The bars of a histogram should cover the entire range of values of a variable, with no space between bars unless a class is empty. • When the possible values of a variable have gaps between them, extend the bases of the bars to meet halfway between two adjacent possible values. • Ex. pants sizes
CAUTION!! • A few cautions about choosing classes: • Two few classes will give a skyscraper graph, with all values in a few classes with tall bars • Two many classes will produce a pancake graph, with most classes having one or no observations
Example continued… The classes are: 1.0 ≤ rate < 1.5 1.5 ≤ rate < 2.0 2.0 ≤ rate < 2.5 2.5 ≤ rate < 3.0 3.0 ≤ rate < 3.5 3.5 ≤ rate < 4.0 4.0 ≤ rate < 4.5 4.5 ≤ rate < 5.0 5.0 ≤ rate < 5.5 5.5 ≤ rate < 6.0 6.0 ≤ rate < 6.5
Interpreting Histograms • The purpose of the graph is to help us understand the data • After you make a graph, always ask, “What do I see?” • Once you have displayed a distribution, you can see its important features
Interpreting Histograms • Examining a distribution • You can describe the overall pattern of a histogram by its Shape, Center, and Spread • An important kind of deviation is an outlier, an individual value that falls outside the overall pattern • Concentrate on the main features • look for rough symmetry or clear skewness • look for major peaks • look for clear outliers
Shapes • Symmetric: the right and left sides are approximately mirror images of each other • Skewed right: the right side extends farther out than the left side • Skewed left: the left side extends farther out than the right side
…Shape • Some types of data regularly produce distributions that are symmetric or skewed • symmetric example: • IQ scores • right skewed example: • Income • left skewed example: • year found on a penny • Some types of data produce distributions that are neither symmetric nor skewed • Stat 226 Test scores…
Center and Spread • Center • For now, we can describe the center by its midpoint • Where are most of the values found? • Spread • For now, we can describe the spread by giving the smallest and largest values
Unemployment revisited… • Shape • Center • Spread • Outliers
Stem (and leaf) plots • For small data sets (less than 100 observations), a stemplot is quicker to make and presents more detailed information about a quantitative variable • When the observed values have many digits, it is best to round the numbers to just a few digits before making a stemplot
Stem (and leaf) plots • Step 1--Order the data from least to greatest • Step 2--Separate each observation into a stem, consisting of all except the final digit (the leaf) • Step 3--Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column • Step 4--Write each leaf in the row to the right of its stem, in increasing order out from the stem • Step 5--Label
Unemployment Rate: Stemplot:
You can also split stems to double the number of stems when all the leaves would otherwise fall on just a few stems • each stem then appears twice • leaves 0 to 4 go on the upper stem • leaves 5 to 9 go on the lower stem • The greater number of stems mightgive a clearer picture of the distribution
…Unemployment split-stem stemplot
Advantages Stemplot: Able to see individual data values Histogram: Quicker and neater with a large data set Disadvantages Stemplot: With a large number of data points plot quickly becomes messy Histogram: Not able to see individual data values Histogram vs. Stemplot
Complete Histogram Example • Henry Cavendish (1798) • 29 measurements of the density of earth as a multiple of water Ordered data:
Step 2 • Class Count • 1) 1 • 2) 0 • 3) 0 • 4) 1 • 5) 1 • 6) 13 • 7) 9 • 8) 4
Step 3 • Shape: • Center: • Spread: • Outliers: • Histogram:
Split-Stem plot (Notice: These values have been rounded to one decimal place)
Time Plots • Many variables are measured at intervals over time: --Closing stock prices (each day) --Number of hurricanes (each year) --Unemployment rates