Uses of Biostatistics in Epidemiology (1)

Uses of Biostatistics in Epidemiology (1) Amornrath Podhipak, Ph.D. Department of Epidemiology Faculty of Public Health Mahidol University 2006

Medical doctors and public health personnel Why Statistics ?? Why Computers ?? Why Software ?? A tools for calculation

Why do we need “statistics” in medicine and public health? (particularly, epidemiology??) • *Medicine is becoming increasingly quantitative in describing a condition. • Most ofmalaria patients are infected with P.falciparum. • 82.5% got P.falciparum. • Those patients looks pale.Haemoglobin level was 9.89 mg%, on • average. • Epidemiology concerns with describing disease pattern in a group of people. Descriptive statistics give a clearer picture of what we want to describe. • * The answer to a research question need to be more definite. • Is the new treatment better: how much better?, in what aspect?, • any evidence? could it be a real difference? • Inferential statistics give an answer in the world of • uncertainty.

Before using statistics, we need some kinds of measurements, in order to get more detailed information. Measurement of characteristics (Variables vs Constant) 4 scales of measurement Qualitative variables - Nominal scale (group classification only) - Ordinal scale (classification with ordering / ranking) Quantitative variables - Interval (magnitude + constant distance between points) - Ratio (magnitude + constant distance between points + true zero)

BP? 140/90 Intelligent? Handsome? Income? 100,000 Weght? 80 kg Married? Height? 160 cm HIV?

Ordinal scale Equal distance between points does not reflect equal interval value. 1 2 3 Female 1 Nominal scale Male Values have no meaning. 2

Interval scale i.e. degree celcius 0 10 20 30 Freezing point was supposed to be zero degree celcius Not the true ZERO temperature (no heat ) Equal distance between points means equal interval value. Ratio scale i.e. weight 0 10 20 30 True ZERO (nothing here) Equal distance between points means equal interval value.

Questionnaire (TB and Passive smoking) Sex [ ] Male [ ] Female Education [ ] 1-6 yr [ ] 7-9 yr [ ] 9+ yr Family income ……………………. Baht/m Passive Smoking ……... Record form Result from tuberculin test ……………………. mm X-ray [ ] +ve [ ] -ve Weight …………. kg, Height ………………….. cm

Variable (characteristic being measured) Result of measurement Type Marital status single/married/divorced nominal gender male/female nominal smoking yes/no nominal smoking nonsmoker/ light smoker/ ordinal moderate smoker/ heavy smoker smoking number of cig/day ratio feeling of pain yes/no nominal feeling of pain none/light/moderate/high ordinal feeling of pain 0 ---------> 10 ordinal attitude toward strongly agree/ agree/ ordinal selective abortion not sure/ disagree/ strongly disagree blood pressure mmHg ratio temperature degree celcius interval weight gram ratio tumor stage I, II, III, IV ordinal

Quantitative (numeric, metric) variables are classified as continuous It can take all values in an interval e.g. weight, temperature, etc. discrete It can take only certain values (often integer value) e.g. parity, number of sex partners, etc. Continuous data can be categorised into groups, which one needs to define “upper boundary” and “lower boundary” of a value (or a class) 120 121 122 123 124 125 126 127 boundaries: 120.5, 121.5, 122.5, 123.5, 124.5 … 120.1 120.2 120.3 120.4 120.5 120.6 120.7 120.8 boundaries: 120.15, 120.25, 120.35, 120.45, 120.55 … 120.11 120.12 120.13 120.14 120.15 120.16 120.17 120.18 boundaries: 120.115, 120.125, 120.135, 120.145, 120.155 …

Descriptive statistics - a way to summarize a dataset (a group of measurement) Example: Height of 100 children, 10-12 years of age. 140 140 140 136 141 123 125 134 125 129 123 161 142 155 129 130 139 129 134 130 140 132 138 142 155 125 136 129 136 153 151 141 138 125 123 134 135 135 135 130 155 130 134 146 135 139 134 142 139 149 147 155 158 135 141 136 136 147 139 132 134 140 141 153 142 127 147 142 146 127 151 140 151 140 141 147 139 134 140 149 132 140 141 142 165 153 146 134 151 151 134 141 138 130 141 132 140 138 127 129 What are values that best describe the height of these 100 persons?

1) Rearrange the data: 123 123 124 125 125 125 125 127 127 127 129 129 129 129 129 130 130 130 130 130 132 132 132 132 134 134 134 134 134 134 134 134 134 135 135 135 135 135 136 136 136 136 136 138 138 138 138139 139 139 139 139 140 140 140 140 140 140 140 140 140 140 141 141 141 141 141 141 141 141 142 142 142 142 142 142 146 146 146 147 147 147 147 149 149 151 151 151 151 151 153 153 153 155 155 155 155 158 161 165 Minimum, Maximum, Range, Median, Mode 123 , 165 , 42 , 139, 140 Max-Min , Value in the middle, Most repeated value

3) Present in a graph (Histogram) Frequency Height (cm)

Methods of data presentation 1. Table 2. Graph - line graph - bar chart - pie chart

- scatter plot - area graph - error bar - histogram

Another set of value for describing a dataset is the MEAN and STANDARD DEVIATION. Mean indicates the location. Standard deviation indicates the scatterness of data (roughly). Example: Dataset 1: Age of 6 children 4 4 4 4 4 4 Mean = 4.0 years sd = 0 y (no variation) Example: Dataset 2: Age of 6 children 2 2 4 4 6 6 Mean = 4.0 years sd = 1.79 y(with variation) or, another example: The average body height of these children was 138.9 cm. with standard deviation of 8.9 cm. The average body height of these children was 138.9 cm. with standard deviation of 0.2 cm.

If we categorize the data into qualitative (tall/short) the proportion would then be calculated. Descriptive statistics (proportion and/or percentage) Most of the children were less than 150 cm. tall. 85% of them had height less than 152 cm.

A final note on defining a variable and a measurement: • Important things to consider before making any measurement: • 1. Do we measure the right thing? • Fatty food and CVD • 2. What is the tool that can actually measure what we want to measure? • Morphology (measure) • indicators % standard weight • body mass index (wt/ht2) • tricep skinfold thickness • Wt for age • Wt for height • etc. • Food intake (ask) Protein calorie intake (ask & calculate) • 3. How valid the instrument? • Does the questionnaire actually get the fatty food intake information? (scope of questions, recall of subjects, certainty of reported amount of food, variability of ingredients, etc.) Does the information obtained actually reflect fatty food intake? • 4. How precise the instrument? • Does the information precisely estimate the amount of fatty food intake for each individual?

In summary: Statistics (and epidemiology) deals with a group (the bigger the group, the better the result) of persons (not one individual patient). We look for the characteristics which are most common in the group. Descriptive statistics is used for explaining our sample (or findings) i.e.  Most of the patients were anemic.  80% of them had haemoglobin level less than 10 mg%. The average haemoglobin level was 9.5 mg% with standard deviation of 1.5 mg%. Inferential statistics (Infer to general population of interest)

Uses of Biostatistics in Epidemiology (1)