590 likes | 779 Views
Probability and Statistics Part 2: Statistics. Lecture 1. Statistics – what is this?. Scientific study of data describing natural variation. Scientific study:. We are concerned with the commonly accepted criteria of validity of scientific evidence.
E N D
Probability and Statistics Part 2: Statistics Lecture 1
Statistics – what is this? Scientific study of data describing natural variation
Scientific study: • We are concerned with the commonly accepted criteria of validity of scientific evidence. • Objectivity in presenting and evaluating data and general ethical code of scientific methodology must constantly be in evidence. „Figures never lie, only statisticians do”
Data: • Statistics generally deals with populations or groups of individuals; it deals with quantities of information, not with a single datum. Thus the measurement of a single individual will generally not be of interest. • The data can be measurements or counts.
Natural variation: • All those events that happen in nature not under the direct control of the investigator are analyzed (for example the number of peas in a pod). • We allow for some partly control of those by the investigator (for example we measure the insulin secretion in response to sugar intake).
The primary objective of statistical analysis is to infer characteristics of a group of data by analyzing the characteristics of a small sampling of the group. This generalization requires the consideration of such important concepts as population and sample.
Some definitions • The data are generally based on individual observations, which are observations or measurements taken on the smallest sampling unit. If we measure height in 100 people, then the height of each person is an individual observation.
Some definitions • Sample of observations is a collection of individual observations selected by a specific procedure. The hundred human heights together represent the sample of observations.
Some definitions • The actual property measured by the individual observations is the variable. • More than one variable can be measured on each smallest sampling unit. We can measure not only height but also weight, and age of each person.
Some definitions • Population means the totality of individual observations about which inferences are to be made, existing anywhere in the world or at least within a definitely specified sampling area limited in space and time. For example: all people aged 18-25 in Gliwice
More about the variables ... • We can say, that variable is a property with respect to which individuals in a sample differ in some ascertainable way. • If the property does not differ within a sample it cannot be of statistical interest.
More about the variables ... Warmbloodedness in a group of mammals is not a variable because they are all alike in this regard, although body temperature of individual mammals would, of course, be a variable.
Measurement variables Ranked variables Attributes Continuous variables Discrete variables More about the variables ... Variables
Measurement variables • Measurement variables are those whose differing states can be expressed in a numerically ordered fashion. • They can be expressed on a ratio or interval scale.
Measurement variables There are two fundamentally important characteristics of the data on a ratio scale: • There is a constant size interval between any adjacent units on the measurement scale. • It is important that there exists a zero point on the measurement scale and that there is a physical significance to this zero.
What does it mean? • Constant size interval: That is, the difference in height between a 166 cm and a 167 cm person is the same as the difference between a 180 cm and a 181 cm. • Zero point: This enables us to say something meaningful about the ratio of measurements. We can say, that a 90 cm tall person is a half as tall as a 180 cm person.
Measurement variables • Some measurement scales possess a constant interval size but not a true zero; they are called interval scales. An outstanding example is that of the temperature scales: Celsius (ºC) and Fahrenheit (ºF). We can see that the same difference exists between 20ºC and 25ºC as between 5ºC and 10ºC. But it cannot be said that a temperature of 40ºC is twice as hot as a temperature of 20ºC; the zero point is arbitrary. (There is no such problem with Kelvin scale)
Measurement variables • Some interval scales encountered in for example biological data collection are circular scales. Time of day and time of the year are examples of such scales. The interval between 14:00 and 15:30 is the same as the interval between 8:00 and 9:30. But one cannot speak of ratios of times of day.
Measurement variables There are two types of measurement variables: • Continuous variables at least theoretically can assume an infinite number of values between any two fixed points. • Discrete (discontinuous, meristic) variables are variables that have only certain fixed numerical values, with no intermediate values possible
Continuous versus discrete Continuous: • Lengths (cm, in), weights (mg, lb), area (sq cm, sq ft), capacities (ml, qt), rates (cm/sec, mph, mg/min), lengths of time (hr, yr), angle (grad, rad), temperature (º), percentage Discrete: • Number of a certain structure (leaves, segments, teeth), number of offsprings, number of white blood cells in 1mm3 of blood, number of giraffes visiting a water hole, number of eggs laid by grasshoper
Ranked variables • Some variables cannot be measured but at least can be ordered or ranked by their magnitude. Such data are said to be on an ordinal scale of measurement and describe more relative than quantitative differences.
Ranked variables • By expressing a variable by a series of ranks, such as 1, 2, 3, 4, 5 we do not imply that the difference in magnitude between ranks 1 and 2 is identical to or even proportional to the difference between 2 and 3. • Ordinal scale data contain and convey less information than ratio or interval data.
Attributes • Variables, that cannot be measured but must be expressed qualitatively are called attributes and are said to be on a nominal scale (from „name”). Attributes are properties as dead or alive, left- or right-handed, male or female, eye color which may be blue or brown, human hair color that may be black, brown, blonde, or red.
Data preprocessing After data have been obtained in a given study, they must be arranged in a form suitable for computation and interpretation. The first step is usually to draw a frequency distribution and calculate some descriptive statistics.
Frequency distributions • Quantitative Those are plots of measurement variables, both continuous and discrete. • Qualitative For attributes only.
Example 462 children were diagnosed with type 1 Diabetes Mellitus 1989-1996. The following data were collected: • Gender (male/female) • Child number in family • Birth year • Birthweigth
Example – Gender We may present data as counts or precentage
Example – child number Measurement discrete variable Ranked variable Sometimes it is necessary to redefine data
Example – birth year Grouping of classes of frequency distribution helps to obtain more cohesive and smooth-looking distribution
Descriptive statistics • We need some form of summary to deal with the data in manageable form, as well as to share our findings with others. A histogram of the frequency distribution is one type of summary. A numerical summary is needed to describe the properties of the observed frequency distribution concisely and accurately. Quantities providing such a summary are called descriptive statistics.
Descriptive statistics Two kinds of descriptive statistics will be discussed: • Statistics of location (measure of central tendency) – describe the position of sample along a given dimension representing a variable. • Statistics of dispersion (measure of dispersion, variability) – is an indication of the spread of measurements around the center of the distribution.
Arithmetic mean • The most widely used is the arithmetic mean, commonly called the mean or average. Each measurement in a sample might be reffered to as an Xi. The subscript i might be any integer value up to N, the total number of individuals in sample.
Arithmetic mean The mean is noted as and
Weighted average • It is often necessary to average means or other statistics that may differ in their reliabilities because, for example, they are based on different sample sizes. In such cases a weighted average needs to be computed.
If the following three means are based on differing sample sizes, their weighted average will be: Example which differs from the unweighted average
Geometric mean • Variables are sometimes transformed into their logarithms. If we calculate the mean of such a transformed variable and then change the mean back into the original scale, this mean will not be the same as if we had computed the arithmetic mean of the original variable. The back-transformed mean of logarithmically transformed variable is called the geometric mean.
Geometric mean Since adding logarithms is equvalent to multiplying their antilogarithms, another way of representing this quantity is
Harmonic mean • The reciprocal of the arithmetic mean of reciprocals is called the harmonic mean and is symbolized by H
Median • The median M is defined as the value of the variable (in an ordered array) that has equal number of items on either side of it. • If the sample size N is odd, then
Median • If N is even, then the subscript above will be a half-integer. This indicates there is not a middle value in ordered list of data, instead there are two middle values, and the median is defined as the midpoint between them:
Other quartiles • The median is just one of a family of statistics dividing a frequency distribution into equal proportions. It divides the distribution into two halves. Quartiles, on the other hand, cut the distribution at the 25%, 50%, and 75% points – that is at points dividing the distribution into first, second, third, and fourth quarters by area. They are usually symbolized by Q1 (lower quartile), M (median), Q3 (upper quartile).
Mode • The mode is commonly defined as the most frequently occuring measurement in a set of data. But sometimes it is better to define a mode as a measurement of relatively great concentration, for some frequency distributions may have more than one such point of concentration, even though these concentrations might not contain precisely the same frequencies.
Example • Let assume a sample consisting of the data: 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9, 10, 11, 12, 12, 12, 12, 12, 12, 13, 13, and 14 mm. Bimodal distribution Minor mode Major mode
Few remarks • The mean is generally preferred in statistics but it is markedly affected by outlying observations, whereas the median and mode are not. • In unimodal, symmetrical distribution the mean, the median, and the mode are all identical.
Measures of dispersion - the range • The range is a measure of the span of variates along the scale of the variable. • It is clearly affected by even a single outlying value; for this reason it is only a rough estimate of the dispersion of all items in the sample.
Interquartile range • The distance between Q1 and Q3, the first and third quartiles (i.e., the 25th and 75th percentiles) is known as the interquartile range also called the quartile deviation.
Mean deviation • As the mean is so useful a measure of central tendency, one might express dispersion in terms of deviations from the mean. Summing the absolute values of the deviations from the mean and dividing it by N results in a measure known as the mean deviation (AD)
Variance • An alternative way of measuring deviations from the mean is in terms of squared distances from the mean. This sum is very important quantity in statistics, known as sum of squares (SS). The variance is the average of the squared deviations.
Standard deviation • The standard deviation is the positive square root of the variance; therefore, it has the same units as the original measurements.