1.29k likes | 2.7k Views
Introduction to Biostatistics. Prof Haroon Saloojee Division of Community Paediatrics. Introduction to Biostatistics Lecture 1. Summarising your data 1. The evidence-based clinician’s motto. In God we trust. All others must bring data. Challenges.
E N D
Introduction to Biostatistics Prof Haroon Saloojee Division of Community Paediatrics
Introduction to BiostatisticsLecture 1 Summarising your data 1
The evidence-based clinician’s motto In God we trust. All others must bring data.
Challenges • Statistical ideas can be difficult and intimidating • Thus: • Statistical results are often “skipped-over” when reading scientific literature • Data is often misinterpreted
Misinterpretation of Data “Celebrating birthdays is healthy” Statistics show that those that celebrate the most birthdays, live the longest.
You may think that: • A Bar Chart is a map of the locations of the nearest taverns • A p-value is the result of a urinalysis • A t-test is a taste test between rooibos tea and Five Roses tea
Course Structure “BIO-SADISTICS” • Four 45-minute lectures • PowerPoint presentations on student web site • Some text (content) also on web page • Plus, additional internet “links”
SESSION 1:Summarizing your data 1 Types of data (quantitative and categorical variables) Describing data- average (mean, median, and mode) Displaying data graphically (box plots, histograms, bar charts, pie diagrams) Frequency distributions SESSION 2:Summarizing your data 2 The normal distribution Describing data – spread (range, variance, standard deviation, z score) Quartiles, percentiles Standard error of the mean Confidence intervals SESSION 3:Sampling principles Study Population The sample Random sampling Non random sampling Sampling bias Sample size and power SESSION 4:Statistical tests and the concept of significance Hypothesis testing p value Statistical versus clinical significance Parametric versus non-parametric methods Syllabus for the Course
Free textbook on-line Statistics at Square One http://bmj.bmjjournals.com/collections/statsbk/index.shtml
http://www.medstatsaag.com/mcqs.asp Relevant topics Handling data 1, 4, 5, 6, 7 Sampling 10, 11 Hypothesis testing 17, 18
Today’s Lecture • What types of data are there? (numerical vs. categorical variables) • Describing data - measures of central tendency (mean, median and mode) • Summarising data graphically (histograms, box plots, bar charts, pie diagrams)
Types of Data Numerical data Discrete Examples • No. of children • No. asthma attacks in a week • No. of rooms in home
Types of Data Numerical data Continuous Any value on the continuum is possible (even fractions or decimals) Examples • Weight • Age • Temperature • Heart rate
Types of Data Categorical data Nominal • Mutually exclusive unordered categories • Examples • Sex (male, female) • Eye colour (brown, grey, green, blue) • Are you happy? (Yes, No) • Diarrhoea (Present, absent) • Can summarize in: • Tables – using counts and percentages • Bar Chart
Types of Data Categorical data Ordinal (ordered categories) Examples • Degree of agreement • (Strongly Agree, Agree, Disagree, Strongly disagree) • Severity of injury • Severe, Moderate, Mild • Income level • High, medium, low
mg of tar in cigarettes number of people in a car high to low temperature in any day weight time number of children in the average family Average / above avg / below average Colours of Smarties Grades (A, B, C, D, F) PRACTICE Discrete or Continuous ? Nominal or Ordinal? Continuous Ordinal Discrete Nominal Continuous Ordinal Continuous Continuous Discrete
Data Summaries • It is ALWAYS a good idea to summarise your data • You become familiar with the data and the characteristics of the people that you are studying • You can also identify problems or errors with the data (data management issues).
Summarising and Describing Continuous Data Measures of the centre of data (central tendency) • Mean • Median • Mode
Definitions • The arithmetic mean is what is commonly called the average. The mean is the sum of all the scores divided by the number of scores. • The median is the middle of a distribution: half the scores are above the median and half are below the median. • The mode is the most frequently occurring score in a distribution
“It has been said that a fellow with one leg frozen in ice and the other leg in boiling water is comfortable… …on average.” J.M. Yancy
Sample Mean X The Average or Arithmetic Mean • Add up data, then divide by sample size (n) • The sample size n is the number of observations (pieces of data) Example Systolic blood pressures (mmHg) • X1 = 120 • X2 = 80 • X3 = 90 • X4 = 110 • X5 = 95 • n = 5
Notation S (sigma) denotes the summation of a set of values x is the variable usually used to represent the individual data values n represents the number of data values in a sample N represents the number of data values in a population µis pronounced ‘mu’ and denotes the mean of all values in a population x is pronounced ‘x-bar’ and denotes the mean of a set of Sample values
Definitions Mean the value obtained by adding the scores and dividing the total by the number of scores Sx x = Sample n Sx µ = Population N
Notes on Sample Mean • Also called sample average or arithmetic mean • Sensitive to extreme values - One data point could make a great change in sample mean • Why is it called the sample mean? – To distinguish it from population mean
Population Versus Sample • Population - The entire group you want information about – For example: The blood pressure of all 20-year-old male university students in South Africa • Sample - A part of the population from which we actually collect information and draw conclusions about the whole population – For example: Sample of blood pressures (n=50) of 20-year-old male university students in South Africa • The sample mean X is not the population mean µ
Population Versus Sample • We don’t know the population mean µ but would like to know it • We draw a sample from the population • We calculate the sample mean X • How close is X to µ? • Statistical theory will tell us how close X is to µ • Statistical inference is the process of trying to draw conclusions about the population from the sample
Weighted Mean S(w •x) x = S w Your grade in many courses are weighted means (averages). In other words, some things count (are weighted) more than others.
Geometric Means These are histograms rotated 90º, and box plots. Note how the log transformation gives a symmetric distribution.
5 5 5 3 1 5 1 4 3 5 2 • 1 1 2 3 3 4 5 5 5 5 5 • (in order) • exact middle MEDIANis 4 • 1 1 3 3 4 5 5 5 5 5 • no exact middle -- shared by two numbers • MEDIAN is 4.5 4 + 5 = 4.5 2
Mode • The score that occurs most frequently Bimodal Multimodal No Mode • The only measure of central tendency that can be used with nominaldata
Examples • Mode is 5 • Bimodal – 2 & 6 • No Mode a. 5 5 5 3 1 5 1 4 3 5 b. 2 2 2 3 4 5 6 6 6 7 9 c. 2 3 6 7 8 9 10 • Mode is 3 • No Mode d. 2 2 3 3 3 4 e. 2 2 3 3 4 4 5 5
Shapes of the Distribution Example: Height of students in the class
Shapes of the Distribution Example: Serum cholesterol level
Shapes of the Distribution Example: Birth weight of newborn babies
Some visual ways to summarize data • Tables • Frequency table • Graphs • Histograms • Bar graphs • Box plots • Line plots • Scatter graphs • Charts • Bar chart • Pie diagram
Frequency Tables • Summarizes a variable with counts and percentages • The variable is categorical • Note that you can take a continuous variable and create categories with it • How do you create categories for a continuous variable? • Choose cutoffs that are biologically meaningful • Natural breaks in the data
Example of frequency table When raw data are arranged with frequencies, they are said to form a frequency table for ungrouped data. When the data are divided into groups/ classes, they are called grouped data. The classes have to be decided according to the range of data and size of class. The number of observations lying in a particular class is called its frequency and the table showing classes with frequencies is called a frequency table. The total of frequencies of a particular class and of all classes prior to that class is called the cumulative frequency of that class.
Graphical Summaries • Histograms • Continuous or ordinal data on horizontal axis • Bar Graphs • Nominal data • No order to horizontal axis • Box Plots • Continuous data
Histogram A histogram is a graphic representation of the frequency distribution of a variable. Vertical rectangles (bars) are drawn in such a way that their bases lie on a linear scale representing different intervals, and their heights are proportional to the frequencies of the values within each of the intervals.
Bar Chart A bar chart is a method of presenting discrete data organized in such a way that each observation can fall into one of mutually exclusive categories. The frequencies (or percentages) are listed along the Y axis and the categories of the variable along the X axis. The heights of the bars correspond to the frequencies. The bars should be of equal width and they should not be touching me other bars.
Difference between bar chart and histogram • Bar charts for categories that are separate • Histograms if you got categories by dividing up continuous data. • Bars do not touch, histogram rectangles do touch.
Line graph If the mid-points of the top of the bars of a histogram are connected together by a line and if the bars were omitted from the display, the resultant graph will be a line graph (also called a frequency polygon). Line graphs are good at showing trends over a period of time. When trends of rates (e.g. death rate, Infant Mortality Rate, etc.) are to be displayed it is better done with line graphs rather than histograms.