450 likes | 474 Views
Data and central tendency. Integrated Disease Surveillance Programme (IDSP) district surveillance officers (DSO) course. Outline of the session. Type of data Central tendency. Epidemiological process. We collect data We use criteria and definitions We analyze data into information
E N D
Data and central tendency Integrated Disease Surveillance Programme (IDSP) district surveillance officers (DSO) course 1
Outline of the session • Type of data • Central tendency 2
Epidemiological process • We collect data • We use criteria and definitions • We analyze data into information • “Data reduction / condensation” • We interpret the information for decision making • What does the information means to us? 3
Surveillance: A role of the public health system The systematic process of collection, transmission, analysis and feedback of public health data for decision making Action Data Information Interpretation Analysis Today we will focus on DATA: The starting point 4 Surveillance
Data: A definition • Set of related numbers • Raw material for statistics • Example: • Temperature of a patient over time • Date of onset of patients 5
Types of data • Qualitative data • No magnitude / size • Classified by counting the units that have the same attribute • Types • Binary • Nominal • Ordinal • Quantitative data 6
Qualitative, binary data • The variable can only take two values • 1,0 often used (or 1,2) • Yes, No • Example: • Sex • Male, Female • Female sex • Yes, No 7
REC SEX --- ---- 1 M 2 M 3 M 4 F 5 M 6 F 7 F 8 M 9 M 10 M 11 F 12 M 13 M 14 M 15 F 16 F 17 F 18 M 19 M 20 M 21 F 22 M 23 M 24 F 25 M 26 M 27 M 28 F 29 M 30 M Frequency distribution for a qualitative binary variable 8
Using a pie chart to display qualitative binary variable Distribution of cases by sex Female Male 9
Qualitative, nominal data • The variable can take more than two values • Any value • The information fits into one of the categories • The categories cannot be ranked • Example: • Nationality • Language spoken • Blood group 10
RecState 1 Punjab 2 Bihar 3 Rajasthan 4 Punjab 5 Bihar 6 Punjab 7 Bihar 8 Bihar 9 UP 10 Rajasthan 11 Bihar 12 Rajasthan 13 Punjab 14 UP 15 Rajasthan 16 UP 17 Punjab 18 UP 19 Rajasthan 20 Bihar 21 UP 22 Bihar 23 UP 24 Rajasthan 25 Bihar 26 Bihar 27 Bihar 28 UP 29 Bihar 30 UP Frequency distribution for a qualitative nominal variable 11
Using a horizontal bar chart to display qualitative nominal variable Bihar UP RJ Punjab 0 5 10 15 Frequency 12 Distribution of cases by state
Qualitative, ordinal data • The variable can only take a number of value than can be ranked through some gradient • Example: • Birth order • First, second, third … • Severity • Mild, moderate, severe • Vaccination status • Unvaccinated, partially vaccinated, fully vaccinated 13
REC Status --- ------- 1 1 2 1 3 2 4 2 5 1 6 2 7 1 8 2 9 3 10 2 11 1 12 3 13 1 14 3 15 1 16 3 17 1 18 1 19 3 20 1 21 1 22 2 23 1 24 2 25 2 26 1 27 2 28 3 29 2 30 2 Frequency distribution for a qualitative ordinal variable Clinical status: 1: Mild; 2 : Moderate; 3 : Severe 14
Using a vertical bar chart to display qualitative ordinal variable 15 10 Frequency 5 0 Mild Moderate Severe 15 Distribution of cases by severity
Key issues • Qualitative data • Quantitative data • We are not simply counting • We are also measuring • Discrete • Continuous 16
Quantitative, discrete data • Values are distinct and separated • Normally, values have no decimals • Example: • Number of sexual partners • Parity • Number of persons who died from measles 17
REC CHILDREN --- ------- 1 1 2 2 3 5 4 6 5 3 6 4 7 1 8 1 9 2 10 3 11 1 12 2 13 7 14 3 15 4 16 2 17 1 18 1 19 1 20 1 21 2 22 3 23 1 24 4 25 2 26 1 27 6 28 4 29 3 30 1 Frequency distribution for a quantitative, discrete data 18
Using a histogram to display a discrete quantitative variable 12 10 8 Frequency 6 4 2 0 1 2 3 4 5 6 7 Number of children 19 Distribution of households by number of children
Quantitative, continuous data • Continuous variable • Can assume continuous uninterrupted range of values • Values may have decimals • Example: • Weight • Height • Hb level • What about temperature? 20
REC WEIGHT --- ------ 1 10.5 2 23.7 3 21.8 4 33.1 5 38.0 6 34.5 7 38.5 8 38.4 9 30.1 10 34.7 11 37.9 12 38.0 13 39.2 14 30.1 15 43.2 16 45.7 17 40.4 18 56.4 19 55.1 20 55.4 21 66.7 22 82.9 23 109.7 24 120.2 25 10.4 26 10.8 27 25.5 28 20.2 29 27.3 30 38.7 Frequency distribution for a continuous quantitative variable: The tally mark 21
REC WEIGHT --- ------ 1 10.5 2 23.7 3 21.8 4 33.1 5 38.0 6 34.5 7 38.5 8 38.4 9 30.1 10 34.7 11 37.9 12 38.0 13 39.2 14 30.1 15 43.2 16 45.7 17 40.4 18 56.4 19 55.1 20 55.4 21 66.7 22 82.9 23 109.7 24 120.2 25 10.4 26 10.8 27 25.5 28 20.2 29 27.3 30 38.7 Frequency distribution for a continuous quantitative variable, after aggregation 22
Using a histogram to display a frequency distribution for a continuous quantitative variable, after aggregation 14 12 10 8 Frequency 6 4 2 0 0-9 ハ10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99 100-9 110-9 Weight categories 23 Distribution of cases by weight
Summary statistics • A single value that summarizes the observed value of a variable • Part of the data reduction process • Two types: • Measures of location/central tendency/average • Measures of dispersion/variability/spread • Describe the shape of the distribution of a set of observations • Necessary for precise and efficient comparisons of different sets of data • The location (average) and shape (variability) of different distributions may be different 24
Describing a distribution Position Dispersion 25
Measures of central tendency • Mode • Median • Arithmetic mean 28
The mode • Definition • The mode of a distribution is the value that is observed most frequently in a given set of data • How to obtain it? • Arrange the data in sequence from low to high • Count the number of times each value occurs • The most frequently occurring value is the mode 29
Mode The mode 20 18 16 14 12 10 N 8 6 4 2 0 30
Examples of mode annual salary (in 10,000 rupees) • 4, 3, 3, 2, 3, 8, 4, 3, 7, 2 • Arranging the values in order: • 2, 2, 3, 3, 3, 3, 4, 4, 7, 8 7, 8 • The mode is three times “3” 31
Specific features of the mode • There may be no mode • When each value is unique • There may be more than one mode • When more than 1 peak occurs • Bimodal distribution • The mode is not amenable to statistical tests • The mode is not based on all the observations 32
The median • The median describes literally the middle value of the data • It is defined as the value above or below which half (50%) the observations fall 33
Computing the median • Arrange the observations in order from smallest to largest (ascendingorder) or vice-versa • Count the number of observations“n” • If “n” is an odd number • Median = value of the (n+1) / 2th observation(Middle value) • If “n” is an even number • Median = the average of the n / 2th and (n /2)+1th observations(Average of the two middle numbers) 34
Example of median calculation • What is the median of the following values: • 10, 20, 12, 3, 18, 16, 14, 25, 2 • Arrange the numbers in increasing order • 2 , 3, 10, 12, 14, 16, 18, 20, 25 • Median = 14 • Suppose there is one more observation (8) • 2 , 3, 8, 10, 12, 14, 16, 18, 20, 25 • Median = Mean of 12 & 14 = 13 35
Advantages and disadvantages of the median • Advantages • The median is unaffected by extreme values • Disadvantages • The median does not contain information on the other values of the distribution • Only selected by its rank • You can change 50% of the values without affecting the median • The median is less amenable to statistical tests 36
Median The median is not sensitive to extreme values Same median 37
Mean (Arithmetic mean / Average) • Most commonly used measure of location • Definition • Calculated by adding all observed values and dividing by the total number of observations • Notations • Each observation is denoted as x1, x2, … xn • The total number of observations: n • Summation process = Sigma : • The mean: X • X = xi /n 38
Computation of the mean • Duration of stay in days in a hospital • 8,25,7,5,8,3,10,12,9 • 9 observations (n=9) • Sum of all observations =87 • Mean duration of stay = 87 / 9 = 9.67 • Incubation period in days of a disease • 8,45,7,5,8,3,10,12,9 • 9 observations (n=9) • Sum of all observations =107 • Mean incubation period = 107 / 9 = 11.89 39
Advantages and disadvantages of the mean • Advantages • Has a lot of good theoretical properties • Used as the basis of many statistical tests • Good summary statistic for a symmetrical distribution • Disadvantages • Less useful for an asymmetric distribution • Can be distorted by outliers, therefore giving a less “typical” value 40
Median = 10 Mode = 13.5 14 12 10 8 N 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Mean = 10.8 41
Ideal characteristics of a measure of central tendency • Easy to understand • Simple to compute • Not unduly affected by extreme values • Rigidly defined • Clear guidelines for calculation • Capable of further mathematical treatment • Sample stability • Different samples generate same measure 42
What measure of location to use? • Consider the duration (days) of absence from work of 21 labourers owing to sickness • 1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 6, 6, 7, 8, 9, 10, 10, 59, 80 • Mean = 11 days • Not typical of the series as 19 of the 21 labourers were absent for less than 11 days • Distorted by extreme values • Median = 5 days • Better measure 43
Type of data: Summary Qualitative Binary Nominal Ordinal Sex State Status M Bihar Mild M Punjab Moderate F Bihar Severe M Punjab Mild F UP Moderate F Bihar Mild M UP Moderate M Rajasthan Severe F Punjab Severe M Rajasthan Mild F Bihar Moderate F UP Moderate M Rajasthan Mild M Bihar Severe M Punjab Severe F Punjab Moderate M Rajasthan Mild F UP Mild M Bihar Mild Quantitative Discrete Continuous Children Weight 1 56.4 1 47.8 2 59.9 3 13.1 1 25.7 1 23.0 2 30.0 3 13.7 2 15.4 2 52.5 1 26.6 1 38.2 1 59.0 2 57.9 2 19.6 3 31.7 2 15.1 3 33.9 1 45.6 44
Definitions of measures of central tendency • Mode • The most frequently occuring observation • Median • The mid-point of a set of orderedobservations • Arithmetic mean • Aggregate / sum of the given observations divided by the number of observation 45