TR 555 Statistics “Refresher” Lecture 1: Probability Concepts

TR 555 Statistics “Refresher”Lecture 1: Probability Concepts • References: • Penn State University, Dept. of Statistics • Statistical Education Resource Kit • a collection of resources used by faculty in Penn State's Department of Statistics in teaching introductory statistics courses. • Page maintained by Laura J. Simon, Sept. 2003 • Statistics: Making Sense of Data (MIT) • William Stout, John Marden and Kenneth Travers • http://www.introductorystatistics.com/ Sept. 2003 • Tom Maze, stat course prepared for KDOT, 2003

Outline • Overview of statistics • Types of data • Describing data numerically and graphically • Probability and random variables

Probability and Statistics • Probably is the likelihood of an event occurring relative to all other events • Example: • If a coin is flipped, what is the probability of getting a heads • 0.5 • Given that the last flip was a heads what is the probability that the next will be heads • 0.5 • Statistics is the measurement and modeling of random variables • Example: • If our state averages 200 fatal crashes per year, what is the probability of having one crash today. Poisson distribution – k = average per time period. 200/365 = 0.55 • P(1 = x) = ((kt)x/x!)e-kt=(0.55*1)1/1!)e-0.55(1)= 0.32

Data Collection • Designing experiments • Does aspirin help reduce the risk of heart attacks? • Observational studies • Polls - Clinton’s approval rating

Relationship Y1 X1 Variable Types • Deterministic • Assume away variation and randomness • Known with certainty • One to one mapping of independent variable to dependent variable

Less Likely Most Likely Less Likely Probability that it could be any of these values Variable Types Continued • Random or Stochastic • Recognized uncertainty of an event • One to one distribution mapping of independent variable to dependent variable

Population The set of data (numerical or otherwise) corresponding to the entire collection of units about which information is sought

Sample A subset of the population data that are actually collected in the course of a study.

WHO CARES? In most studies, it is difficult to obtain information from the entire population. We rely on samples to make estimates or inferences related to the population.

Organization and Description of Data • Qualitative vs. Quantitative data • Discrete vs. Continuous Data • Graphical Displays • Measures of Center • Measures of Variation

Qualitative (Categorical) Data The raw (unsummarized) data are merely labels or categories Quantitative (Numerical) Data The raw (unsummarized) data are numerical

Qualitative Data Examples • Class Standing (Fr, So, Ju, Sr) • Section # (1,2,3,4,5,6) • Automobile Make (Ford, Chevrolet, Nissan) • Questionnaire response (disagree, neutral, agree)

Quantitative Data Examples (measures) • Voltage • Height • Weight • SAT Score • Number of students arriving late for class • Time to complete a task

Discrete Data Continuous Data Only certain values are possible (there are gaps between the possible values) Theoretically, any value within an interval is possible with a fine enough measuring device

Discrete Data Examples • Number of students late for class • Number of crimes reported to SC police • Number of times the word number is used (generally, discrete data are counts)

Discrete Variable ModelPoisson Distribution (0.55*t)x/x!)e-0.55(t)

Continuous Data Examples • Voltage • Height • Weight • Time to complete a homework assignment

Continuous Variable ModelExponential Distribution Probability of first Fatal at time t = ke-tk

Continuous Probability Function Cumulative Probability of Time Till First Fatal t = 1 - e-tk

Nominal Data • A type of categorical data in which objects fall into unordered categories, for example: • Hair color • blonde, brown, red, black, etc. • Race • Caucasian, African-American, Asian, etc. • Smoking status • smoker, non-smoker

Ordinal Data • A type of categorical data in which order is important. For example … • Class • fresh, sophomore, junior, senior, super senior • Degree of illness • none, mild, moderate, severe, …, going, going, gone • Opinion of students about riots • ticked off, neutral, happy

Binary Data • A type of categorical data in which there are only two categories. • Binary data can either be nominal or ordinal, for example … • Smoking status • smoker, non-smoker • Attendance • present, absent • Class • lower classman, upper classman

Interval and Ratio Data • Interval • Interval is important, but no meaningful zero • e.g, temperature in farenheit • Ratio • has a meaningful zero value • e.g., temperature in Kelvin, crash rate

Who Cares? The type(s) of data collected in a study determine the type of statistical analysis used.

Proportions • Categorical data are commonly summarized using “percentages” (or “proportions”). • 11% of students have a tattoo • 2%, 33%, 39%, and 26% of the students in class are, respectively, freshmen, sophomores, juniors, and seniors

Averages • Measurement data are typically summarized using “averages” (or “means”). • Average number of siblings Fall 1998 Stat 250 students have is 1.9. • Average weight of male Fall 1998 Stat 250 students is 173 pounds. • Average weight of female Fall 1998 Stat 250 students is 138 pounds.

Descriptive statistics Describing data with numbers: measures of location

Mean • Another name for average. • If describing a population, denoted as , the greek letter “mu”. • If describing a sample, denoted as x, called “x-bar”. • Appropriate for describing measurement data. • Seriously affected by unusual values called “outliers”. _

Calculating Sample Mean Formula: That is, add up all of the data points and divide by the number of data points. Data (# of classes skipped): 2 8 3 4 1 Sample Mean = (2+8+3+4+1)/5 = 3.6 Do not round! Mean need not be a whole number.

Population Mean The mean of a random variable X is called the population mean and is denoted It is also called the expected value of X or the expectation of X and is denoted E(X).

Median • Another name for 50th percentile. • Appropriate for describing measurement data. • “Robust to outliers,” that is, not affected much by unusual values.

Calculating Sample Median Order data from smallest to largest. If odd number of data points, the median is the middle value. Data (# of classes skipped): 2 8 3 4 1 Ordered Data: 12 3 4 8 Median

Calculating Sample Median Order data from smallest to largest. If even number of data points, the median is the average of the two middle values. Data (# of classes skipped): 2 8 3 4 1 8 Ordered Data: 12 3 4 8 8 Median = (3+4)/2 = 3.5

Mode • The value that occurs most frequently. • One data set can have many modes. • Appropriate for all types of data, but most useful for categorical data or discrete data with only a few number of possible values.

Most appropriate measure of location • Depends on whether or not data are “symmetric” or “skewed”. • Depends on whether or not data have one (“unimodal”) or more (“multimodal”) modes.

Symmetric and Unimodal

Symmetric and Bimodal

Skewed Right

Skewed Left

Choosing Appropriate Measure of Location • If data are symmetric, the mean, median, and mode will be approximately the same. • If data are multimodal, report the mean, median and/or mode for each subgroup. • If data are skewed, report the median.

Descriptive statistics Describing data with numbers: measures of variability

Range • The difference between largest and smallest data point. • Highly affected by outliers. • Best for symmetric data with no outliers.

Interquartile range • The difference between the “third quartile” (75th percentile) and the “first quartile” (25th percentile). So, the “middle-half” of the values. • IQR = Q3-Q1 • Robust to outliers or extreme observations. • Works well for skewed data.

Variance 1. Find difference between each data point and mean. 2. Square the differences, and add them up. 3. Divide by one less than the number of data points.

Variance • If measuring variance of population, denoted by 2 (“sigma-squared”). • If measuring variance of sample, denoted by s2 (“s-squared”). • Measures average squared deviation of data points from their mean. • Highly affected by outliers. Best for symmetric data. • Problem is units are squared.

Population Variance The variance of a random variable X is called the population variance and is denoted

Standard deviation • Sample standard deviation is square root of sample variance, and so is denoted by s. • Units are the original units. • Measures average deviation of data points from their mean. • Also, highly affected by outliers.

Population Standard Deviation The population standard deviation is the square root of the population variance and is denoted

What is the variance or standard deviation? (MPH)

Variance or standard deviation Sex N Mean Median TrMean StDev SE Mean female 126 91.23 90.00 90.83 11.32 1.01 male 100 06.79 110.00 105.62 17.39 1.74 Minimum Maximum Q1 Q3 female 65.00 120.00 85.00 98.25 male 75.00 162.00 95.00 118.75 Females: s = 11.32 mph and s2 = 11.322 = 128.1 mph2 Males: s = 17.39 mph and s2 = 17.392 = 302.5 mph2

TR 555 Statistics “Refresher” Lecture 1: Probability Concepts