1.73k likes | 2.71k Views
IAM 530 ELEMENTS OF PROBABILITY AND STATISTICS. INTRODUCTION. WHAT IS STATISTICS?. Statistics is a science of collecting data, organizing and describing it and drawing conclusions from it. That is, statistics is a way to get information from data. It is the science of uncertainty.
E N D
IAM 530ELEMENTS OF PROBABILITY AND STATISTICS INTRODUCTION
WHAT IS STATISTICS? • Statistics is a science of collecting data, organizing and describing it and drawing conclusions from it. That is, statistics is a way to get information from data. It is the science of uncertainty.
WHAT IS STATISTICS? • A pharmaceutical CEO wants to know if a new drug is superior to already existing drugs, or possible side effects. • How fuel efficient a certain car model is? • Is there any relationship between your GPA and employment opportunities? • Actuaries want to determine “risky” customers for insurance companies.
STEPS OF STATISTICAL PRACTICE • Preparation: Set clearly defined goals, questions of interests for the investigation • Data collection: Make a plan of which data to collect and how to collect it • Data analysis: Apply appropriate statistical methods to extract information from the data • Data interpretation: Interpret the information and draw conclusions
STATISTICAL METHODS • Descriptive statistics include the collection, presentation and description of numerical data. • Inferential statistics include making inference, decisions by the appropriate statistical methods by using the collected data. • Model building includes developing prediction equations to understand a complex system.
BASIC DEFINITIONS • POPULATION: The collection of all items of interest in a particular study. • SAMPLE: A set of data drawn from the population; a subset of the population available for observation • PARAMETER: A descriptive measure of the population, e.g., mean • STATISTIC: A descriptive measure of a sample • VARIABLE: A characteristic of interest about each element of a population or sample.
EXAMPLE PopulationUnitSample Variable All students currently Student Any departmentGPA enrolled in schoolHours of works per week All books in library BookStatistics’ BooksReplacement cost Frequency of check out Repair needs All campus fast food RestaurantBurger King Number of employees restaurants Seating capacity Hiring/Not hiring Note that some samples are not representative of population and shouldn’t be used to draw conclusions about population.
How not to run a presidential poll For the 1936 election, the Literary Digest picked names at random out of telephone books in some cities and sent these people some ballots, attempting to predict the election results, Roosevelt versus Landon, by the returns. Now, even if 100% returned the ballots, even if all told how they really felt, even if all would vote, even if none would change their minds by election day, still this method could be (and was) in trouble: They estimated a conditional probability in that part of the American population which had phones and showed that that part was not typical of the total population. [Dudewicz & Mishra, 1988]
STATISTIC • Statistic (or estimator) is any function of a r.v. of r.s. which do not contain any unknown quantity. e.g. • are statistics. • are NOT. • Any observed or particular value of an estimator is an estimate.
RANDOM VARIABLES • Variables whose observed value is determined by chance. • A random variable, usually written X, is a variable whose possible values are numerical outcomes of a random phenomenon. • A r.v. is a function defined on the sample space S that associates a real number with each outcome in S. • Rvs are denoted by uppercase letters, and their observed values by lowercase letters. • Example: Consider the random variable X, the number of brown-eyed children born to a couple heterozygous for eye color (each with genes for both brown and blue eyes). If the couple is assumed to have 2 children, X can assume any of the values 0,1, or 2. The variable is random in that brown eyes depend on the chance inheritance of a dominant gene at conception. If for a particular couple there are two brown-eyed children, we have x=2.
RANDOM VARIABLES Let's give them the values Heads=0 and Tails=1 and we have a Random Variable "X": In short: X = {0, 1}
CONTINUOUS OR NON-CONTINUOUS VARIABLE • A continuous variable is one in which it can theoretically assume any value between the lowest and highest point on the scale on which it is being measured (e.g. speed, price, time, height) • Non-continuous variables, also known as discrete variables, • Variables that can only take on a finite number of values • All qualitative variables are discrete
COLLECTING DATA • Target Population: The population about which we want to draw inferences. • Sampled Population: The actual population from which the sample has been taken.
SAMPLING PLAN • Simple Random Sample (SRS): All possible samples with the same number of observations are equally likely to be selected. • Stratified Sampling: Population is separated into mutually exclusive sets (strata) and then sample is drawn using simple random samples from each strata. • Convenience Sample: It is obtained by selecting individuals or objects without systematic randomization.
EXAMPLE • A manufacturer of computer chips claims that less than 10% of his products are defective. When 1000 chips were drawn from a large production run, 7.5% were found to be defective. • What is the population of interest? • What is the sample? • What is parameter? • What is statistic? • Does the value 10% refer to a parameter or a statistics? • Explain briefly how the statistic can be used to make inferences about the parameter to test the claim. The complete production run for the computer chips 1000 chips Proportion of the all chips that are defective Proportion of sample chips that are defective Parameter Because the sample proportion is less than 10%, we can conclude that the claim may be true.
DESCRIPTIVE STATISTICS • Descriptive statistics involves the arrangement, summary, and presentation of data, to enable meaningful interpretation, and to support decision making. • Descriptive statistics methods make use of • graphical techniques • numerical descriptive measures. • The methods presented apply both to • the entire population • the sample
Types of data and information • A variable - a characteristic of population or sample that is of interest for us. • Cereal choice • Expenditure • The waiting time for medical services • Data - the observed values of variables • Interval and ratio data are numerical observations (in ratio data, the ratio of two observations is meaningful and the value of 0 has a clear “no” interpretation. E.g. of ratio data: weight; e.g. of interval data: temp.) • Nominal data are categorical observations • Ordinal data are ordered categorical observations
QUALITATIVE VS. QUANTITATIVE DATA • A qualitative variable is one in which the “true” or naturally occurring levels or categories taken by that variable are not described as numbers but rather by verbal groupings • Example: levels or categories of hair color (black, brown, blond) • Quantitative variables on the other hand are those in which the natural levels take on certain quantities (e.g. price, travel time) • That is, quantitative variables are measurable in some numerical unit (e.g. pesos, minutes, inches, etc.)
SCALES OF MEASUREMENT Scales of measurement describe the relationships between the characteristics of the numbers or levels assigned to objects under study Four classificatory scales of measurement are nominal, ordinal, interval, and ratio
NOMINAL SCALED DATA A nominal scaled variable is a variable in which the levels observed for that variable are assigned unique values – values which provide classification but which do not provide any indication of order For example: we may assign the value zero to represent males and one to represent females; but we are not saying that females are better than males or vice versa. Nominal scaled variables are also termed as categorical variables Nominal data must be discrete All mathematical operations are meaningless (i.e. +, -, , and x)
ORDINAL SCALED DATA Ordinal scaled data are data in which the values assigned to levels observed for an object are (1) unique and (2) provide an indication of order An example of this is ranking of products in order of preference. The highest-ranked product is more preferred than the second-highest-ranked product, which in turn is more preferred than the third-ranked product, etc. While we may now place the objects of measure in some order, we cannot determine distances between the objects. For example, we might know that product A is preferred to product B; however, we do not know by how much product A is preferred to product B. For ordinal scales, again +, -, , and x are meaningless
INTERVAL SCALED DATA Interval scaled data are data in which the levels of an object under study are assigned values which are (1) unique, (2) provide an indication of order, and (3) have an equal distance between scale points. The usual example is temperature (centigrade or Fahrenheit). In either scale, 41 degrees is higher than 40 degrees. However, zero degrees is an arbitrary figure – it does not represent an absolute absence of heat We can add or subtract interval scale variables meaningfully but ratios are not meaningful (that is, 40 degrees is not exactly twice as hot as 20 degrees).
RATIO SCALED DATA Ratio scaled data are data in which the values assigned to levels of an object are (1) unique, (2) provide an indication of order, (3) have an equal distance between scale points, and (4) the zero point on the scale of measure used represents an absence of the object being observed. We can add, subtract, divide and multiply such variables meaningfully.
OTHER DATA TYPES Dummy Variables from Quantitative Variables A quantitative variable can be transformed into a categorical variable, called a dummy variable by recoding the values. Consider the following example: the quantitative variable Age can be classified into five intervals. The values of the associated categorical variable, called dummy variables, are 1, 2,3,4,5:
Types of data – summary • Knowing the type of data is necessary to properly select the technique to be used when analyzing data. • Types of descriptive analysis allowed for each type of data • Numerical data – arithmetic calculations • Nominal data – counting the number of observation in each category • Ordinal data - computations based on an ordering process
Types of data - examples Numerical data Nominal Age - income 55 75000 42 68000 . . . . PersonMarital status 1 married 2 single 3 single . . . . Weight gain +10 +5 . . Computer Brand 1 IBM 2 Dell 3 IBM . . . .
Types of data - examples Numerical data Nominal data A descriptive statistic for nominal data is the proportion of data that falls into each category. Age - income 55 75000 42 68000 . . . . Weight gain +10 +5 . . IBM Dell Compaq Other Total 25 11 8 6 50 50% 22% 16% 12%
Cross-Sectional/Time-Series/Panel Data • Cross sectional data is collected at a certain point in time • Test score in a statistics course • Starting salaries of an MBA program graduates • Time series data is collected over successive points in time • Weekly closing price of gold • Amount of crude oil imported monthly • Panel data is collected over successive points in time as well
CIRCULAR (DIRECTIONAL) DATA • Directional or circular distributions are those that have no true zero and any designation of high or low values is arbitrary: • Compass direction • Hours of the day • Months of the year
FUNCTIONAL DATA • Functional data is made up of repeated measurements, taken as a function of something (e.g., time) • For example, a trajectory is an example of functional data - we have the position or velocity sampled at many time points
Example • One expects temperature to be primarily sinusoidal in character, and certainly periodic over the annual cycle. • There is much variation in level and some variation in phase. • A model of the form
• Unlike time series analyses, no assumptions of stationarity are made, and data are not sampled at equally spaced time points. • Unlike most longitudinal data, a large number of time points are available, and the signal-to-noise ratio is medium to high. • The data can support the accurate estimate of one or more derivatives, and these play several critical roles. • Phase variation is recognized and separated from amplitude variation. • Familiar multivariate methods have functional counterparts, and the smoothness of functional parameter estimates is explicitly controlled. • Differential equations are new modelling tools.
Summary Measures Describing Data Numerically Central Tendency Variation Shape Arithmetic Mean Range Skewness Median Interquartile Range Mode Variance Geometric Mean Standard Deviation Coefficient of Variation Quartiles
Measures of Central Location • Usually, we focus our attention on two types of measures when describing population characteristics: • Central location • Variability or spread The measure of central location reflects the locations of all the actual data points.
With one data point clearly the central location is at the point itself. Measures of Central Location • The measure of central location reflects the locations of all the actual data points. • How? With two data points, the central location should fall in the middle between them (in order to reflect the location of both of them). But if the third data point appears on the left hand-side of the midrange, it should “pull” the central location to the left.
Sum of the observations Number of observations Mean = The Arithmetic Mean • This is the most popular and useful measure of central location
The Arithmetic Mean Sample mean Population mean Sample size Population size
Example 2 Suppose the telephone bills represent the populationof measurements. The population mean is The Arithmetic Mean • Example 1 The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find the mean time on the Internet. 0 7 22 11.0 42.19 38.45 45.77 43.59
The Arithmetic Mean • Drawback of the mean: It can be influenced by unusual observations, because it uses all the information in the data set.
Example 3 Find the median of the time on the internetfor the 10 adults of example 1 Suppose only 9 adults were sampled (exclude, say, the longest time (33)) Comment Even number of observations 0, 0, 5, 7, 8,9, 12, 14, 22, 33 The Median • The Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude. It divides the data in half. Odd number of observations 8 8.5, 0, 0, 5, 7, 89, 12, 14, 22 0, 0, 5, 7, 8,9, 12, 14, 22, 33
The Median • Median of 8 2 9 11 1 6 3 n = 7 (odd sample size). First order the data. 1 2 3 6 8 9 11 Median • For odd sample size, median is the {(n+1)/2}th ordered observation.
The Median • The engineering group receives e-mail requests for technical information from sales and services person. The daily numbers for 6 days were 11, 9, 17, 19, 4, and 15. What is the central location of the data? • For even sample sizes, the median is the average of {n/2}th and {n/2+1}th ordered observations.
The Mode • The Mode of a set of observations is the value that occurs most frequently. • Set of data may have one mode (or modal class), or two or more modes. For large data sets the modal class is much more relevant than a single-value mode. The modal class
The Mode • Find the mode for the data in Example 1. Here are the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 Solution • All observation except “0” occur once. There are two “0”. Thus, the mode is zero. • Is this a good measure of central location? • The value “0” does not reside at the center of this set(compare with the mean = 11.0 and the median = 8.5).