330 likes | 586 Views
Stat 226. Note Set 1 James D. Abbey. Statistics?. The age old question of all courses: What is ______? For our purposes, we need to know what statistics does for us. Perhaps we will have an answer by the end of the course!
E N D
Stat 226 Note Set 1 James D. Abbey
Statistics? • The age old question of all courses: What is ______? For our purposes, we need to know what statistics does for us. Perhaps we will have an answer by the end of the course! • Your text definition: Statistics is the science of collecting, organizing, and interpreting numerical facts, which we call data. What does this definition mean to us?!
The inevitable terms • As with any study, we must learn certain terms. A few early ones: • Individuals: Objects of data • A person, a monkey, a person who acts like a monkey, etc. • Variables: A characteristic of an individual or set of individuals • Purchasing budget, a demographic, etc.
Definitions of definitions! • Now that we know what a variable represents, we need to know what types of variables we will encounter: • Categorical: A categorical variable differentiates individuals into categories • City, suburb or rural locations • Coke or Pepsi? (What about Mountain Dew!) • Quantitative: A quantitative variable has an intuitive numerical value • Budget amounts in dollars, total time to complete a project in days, etc.
Analyzing the variables • We need to see the distribution of our variables. • Raw data? 76,77,78,82,83,84,85,86,86, 88,89,92,93,95,96,96,98,99…. Not easy to see patterns. • Graphical analysis?
Analyzing Distributions • Describing a distribution (exploratory analysis) • Method 1: Look at a each variable and then expand to a multiple variable comparison • Jump straight to the data. Works well for very simple data sets. • Method 2: Start with graphs of the data. Next, examine specific numerical summaries • Use graphs to tame the data and give intuition about variables/topics worth investigating.
Graphical Analysis • Categorical: • Bar graph: a bar graph with counts or percent within each category • Pareto graph: A reorganized bar graph with categories in order of greatest relative frequency • Pie Graph: Slices of pie represent the percent or count for each category • Quantitative: • Bar graph: Use the bar graph to give counts of pre-specified ranges for the quantitative variable • Also known as a Histogram
Categorical Graphical Analysis • Bar Graph: • Analyze Distribution
Categorical Graphical Analysis • Pareto Graph: Graph Pareto Plot
Categorical Graphical Analysis • Pie Graph: Graph Chart
Quantitative Graphical Analysis • The Histogram • Steps • 1) Divide the individuals into non-overlapping, equal width groups • 2) Count the number of individuals in each group • 3) Break out the old, dusty drawing skills to draw the histogram (or use a computer!)
The Histogram • Notes before we see a histogram • The base or horizontal axis (x-axis) contains the pre-defined categories of individuals • The vertical (y-axis) contains the count or relative frequency of the individuals within each category • Caution: Avoid histograms with all the individuals in one category (sky-scrappers) or with individuals spread among too many categories (the wide blob)
The Histogram • The data: 1,1,1,1,2,2,3,3,3,3,3,4,4,4 • JMP output from Analyze Distribution • Good enough Bad (too wide)
Another Histogram • Histogram of Unemployment Data on pages 10-11 of text (ex 1.3)
Histograms show us what? • Histograms show us three prime attributes • Shape: Symmetric, Skewed • Symmetric | Skewed Right | Skewed Left (minor) • Skew right means the data extends far to the right • Skew left means the data extends far to the left • Center: Where does the data heavily clump • Spread: How low and high the data values go • Also, outliers come into the category of spread
Identifying the traits • Unemployment Data (pgs 10-11)
Keeping the data: Stemplots • The stem plot keeps the data visible while giving the benefits of a histogram • However, you should only use stemplots for data sets with less than 100 observations • If the data has a large number of digits, you may wish to use only a few significant digits • Ex: 9.54234, 10.12341 become 9.5, 10.1
The Stemplot • Creating a stemplot • 1) Categorize each individual/observation into • A stem: all but the final digit • The leaf: the final digit • 2) List the stems vertically starting with the smallest value at the top • 3) Place the leafs in the rows within each stem category
The Stemplot • The unemployment data • The stem is the one’s digit while the leaf is the decimal • Breakdown: • 1.5% becomes 1|5 • 2.0% becomes 2|0
Expanding the Stemplot • The split-stemplot • The last stemplot had some very long categories (not good) • So, we can split the stems into sub-categories • When split, each stem will appear twice • Ex: • 1.0 to 1.9 becomes 1.0 to 1.4 and 1.5 to 1.9 • 2.0 to 2.9 becomes 2.0 to 2.4 and 2.5 to 2.9 • Etc.
The split stemplot • Unemployment data split-stem stemplot • The category 1.0 to 1.9 became 1.5 to 1.9 (1.0 to 1.4 was empty) • 2.0 to 2.9 became 2.0 to 2.4 and 2.5 to 2.9
Examining Quantitative Data • Henry Cavendish density of earth as a multiple of water. • The ordered data
Cavendish does dirt (cont.) • Step 1? Categorize!
Cavendish plays with dirt (cont) • Step 2? Count the individuals in each category!
Cavendish’s dirt again • Step 3: Graphical Analysis. Histogram or stemplot time • The histogram: • Shape: • Skewed left • Center: • 5.46 • Spread: • 4.07 to 5.86 Is 4.07 an outlier? Perhaps….
Cavendish’s dirt on a stemplot • Time for a stemplot! • What’s wrongwith this stemplot?:
Cavendish’s dirt on a stemplot • A better stemplot (split stem plot): • Aside from the split of the numbers, what else is different in this stemplot?
An overview of Time Plots • At times, data are based on time. For example: • Stock prices at different times of the day or at closing across days • Temperatures in Ames, Iowa by day across an entire year • Your GPA by semester until you finish college
Time plots defined • Time plot traits: • The x-axis (horizontal) is always some unit of time (days, weeks, semesters, etc.) • The y-axis (vertical) is the variable of interest with an appropriate scale • The customary time plot has connected points to better see trends • If the intervals are consistent, then you have a time series (a fairly complex area of statistics)
Time plots (and time series) • In time series, we may find seasonal patterns • Shopping revenues spike during the holiday seasons • Time series may also show trends • Energy prices over the last decade show a general uptrend
Time plots • A time plot of seasonally adjusted unemployment data from 1948 to 1993:
Section 1.1 Summary • Datasets contain: • Individuals: Objects of data including people, prices or any object of interest • Variables: A characteristic of an individual or set of individuals such as a budget or salary • Variables: • Categorical: Places individuals into categories such as undergraduate or graduate student • Quantitative: A variable with a native numerical measure such as price of gasoline
Section 1.1 Summary • Exploratory data analysis (Graphical): • Categorical: Bar graph, Pareto graph, Pie graph • Quantitative: Histogram, Stemplot • When examining graphics, look for • Shape, center and spread as well as outliers. • Remember to use timeplots for data that occur over time (e.g., stock prices at closing over the past month)