1.11k likes | 1.35k Views
STATISTICS. Summarizing, Visualizing and Understanding Data. I. Populations, Variables, and Data. Populations and Samples.
E N D
STATISTICS Summarizing, Visualizing and Understanding Data
Populations and Samples To a statistician, the population is the set or collection under investigation. Individual members of the population are not usually of interest. Rather, investigators try to infer with some degree of confidence the general features of the population.
Examples • Students currently enrolled at a certain university. • Registered voters in a certain Congressional district. • The population of large-mouthed bass in a certain lake. • The population of all decay times of a radioactive isotope.
StatisticalInference • Drawing and quantifying the reliability of conclusions about a population from observations on a smaller subset of the population. • Sample: The subset observed.
Variables and Data • A population variable is a descriptive number or label associated with each member of a population. • The values of a population variable are the various numbers (or labels) that occur as we consider all the members of the population. • Values of variables that have been recorded for a population or a sample from a population constitute data.
Types of Data • Nominal variables are variables whose values are labels. • Ordinal variables are variables whose values have a natural order. • Interval variables have values represented by numbers referring to a scale of measurement. • Ratio variables have values that are positive numbers on a scale with a unit of measurement and a natural zero point.
Guess the Type • Age • Questionnaire responses: 1=”strongly agree”,2=”agree”…,5=”strongly disagree” • Letter grades • Reading comprehension scores • Gender • Zip codes • Molecular velocities
Location Measures (Measures of Central Tendency) A location measure or measure of central tendency for a variable is a single value or number that is taken as representing all the values of the variable. Different location measures are appropriate for different types of data.
The Mean • For interval or ratio variables x • N individuals in the sample or population • xi = value of x for ith individual The mean of a population variable is denoted by m (the Greek letter mu).
The Mean with Repeated Values • Distinct values of x: • nj = frequency of occurrence of
The Mean with Repeated Values • Relative frequencies:
The Median • Informally, the “middle” value when all the values are arranged in order • A number m is a median of x if at least half the individuals i in the population have and at least half of them have
The Median – Example 1 • x: –2.0, 1.5, 2.2, 3.1, 5.7 (no repetitions) • median(x)=2.2
The Median – Example 2 • x: -2.0, 1.5, 3.1, 3.1, 3.1 • median(x) = 3.1
The Median – Example 3 • x: -2.0, 1.5, 3.1, 5.7, 5.9, 7.1 • median(x)=Any number in [3.1,5.7] • By convention, for an even number of individuals choose the midpoint between the smallest and largest medians, e.g.,
Example • Change 7.1 to 71. What happens to the mean and the median? • The mean changes from 3.55 to 14.2 • No change in the median • The median is much less sensitive to outliers (which may be mistakes in recording data)
The Median for Ordered Categories N=100. The median grade is B-.
The Mode • The data value with the greatest frequency • Not useful for interval or ordinal data if recorded with precision • The only useful location measure for strictly nominal data
Example The modes are B and B-.
Cumulative Frequencies and Percentiles • x is an interval or ratio variable. • Ordered distinct values: • Relative frequencies:
Cumulative Frequencies Cumulative Relative Frequencies Cumulative Frequencies and Percentiles
Exercise From the table above, what fraction of the data is less than 1? What fraction is greater than 3? What fraction is greater than or equal to 3?
Percentiles • x: an interval or ratio variable • A number a is a pthpercentile of x if at least p% of the values of x are less than or equal to a and at least (100-p) % of the values of x are greater than or equal to a. • The 25th percentile is called the first quartile of x and the 75th percentile is the third quartile of x. • The 50th percentile is the second quartile or median.
Example For the weather person’s errors, the 25th percentile is 3. The 50th percentile and third quartile are both 4.
Measures of Variability Statisticians are not only interested in describing the values of a variable by a single measure of location. They also want to describe how much the values of the variable are dispersed about that location.
Population Variance and Standard Deviation • x: an interval or ratio variable. • N=number of individuals in population. • Variance of x: • Standard deviation of x:
Sample Variance and Standard Deviation • n: the number of individuals in a sample from a population • Sample variance: • Sample standard deviation:
AlternativeFormulas for the Variance • Using frequencies: • Using relative frequencies:
The Interquartile Range • Q1, Q3 : 1st and 3rd quartiles, respectively • Interquartile range: • Not influenced by a few extremely large or small observations (outliers)
The Range • The difference between the largest data value and the smallest • Range of sample values is not a reliable indicator of the range of a population variable
Pie Charts (Circle Graphs) Sources: AT&T (1961) The World’s Telephones R: A language and environment for statistical computing, the R core development team.
Pros and Cons • Bar chart has a scale of measurement – more precise information • Pie chart gives more vivid impression of relative proportions, e.g., obvious at a glance that N. America had more than half the telephones in the world.
Stemplots (Stem and Leaf Diagrams) Stem|Leaves Cumulative Frequency 4 | 7 1 5 | 448889 7 6 | 34789 12 7 | 012234455666888889999 33 8 | 0022234457799 46 9 | 0457 50 Grades of 50 students on a test
Find the Median Stem|Leaves Cumulative Frequency 4 | 7 1 5 | 448889 7 6 | 34789 12 7 | 012234455666888889999 33 8 | 0022234457799 46 9 | 0457 50 25th and 26th leaves circled. Median = 78
Exercise Stem|Leaves Cumulative Frequency 4 | 7 1 5 | 448889 7 6 | 34789 12 7 | 012234455666888889999 33 8 | 0022234457799 46 9 | 0457 50 The 1st quartile is 70 and the 3rd quartile is 82.
Elements of a Boxplot largest outlier box whisker quartiles median
Histograms • For interval or ratio data • Data is grouped into class intervals • Superficially like a bar chart
Frequency Histogram Height=bin frequency Class interval (bin) Source: R: A language and environment for statistical computing, the R core development team.
Probability Histogram Area of bar = relative bin frequency E.g., .011×25=.275
Ogives(Cumulative Frequency Polygons) • Related to probability histograms • Examples of cumulative distribution functions • Probability histograms are examples of density functions
Relationship Between Probability Histogram and Ogive The height of the ogive is the cumulative area under the histogram