480 likes | 602 Views
Descriptive Statistics. F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics. March 18, 2009. To understand and recognize different types of variables To learn how to explore your data How to display data with numbers and tables
E N D
Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009
To understand and recognize different types of variables To learn how to explore your data How to display data with numbers and tables How to display data using graphs To understand the fundamental concept of variability To learn the notion of the distribution of a variable Objectives
Why and how are statistics relevant to medicine? • Prevention – What causes a disease? • Diagnosis – What symptoms and signs do patients with a given disease present with? • Treatment – What treatments are effective for a given disease and for which patients? • Prognosis – How will specific patients with a given disease fare in the long term?
B A E W D S A Q P B B W E O N F O H E E R D T TY E D T E Q O N E G G O L T S D G F E W G E G G V B A Y A O E E D Y H E J U E G D E T E W W E T H E F E O P L U M R HOW MANY ‘E”’s?
Descriptive and Inferential statistics? • Descriptive statistics are concerned with the presentation, organization, and summarization of data • Inferential statistics allow us the generalization from a sample to a larger group of subjects.
What is data? • Data is collected for some purpose and each collected information have a meaning in some context. • Data is a set of information or observation about a group of individuals or subjects. • This information is organized in form of variables. • A variable is any characteristic of a person or a subject that can be measured or categorized and its value varies from individual to individual.
Dependent and Independent Variables? Dependent variable • Is the outcome of interest, which changes in response to some intervention or exposure. mortality, survival, post-op pain, quality of life, post-op complications Independent variable • Is the explanatory variable that explains the changes in the dependent variable demographics (age, gender, height), risk factors (diabetes, CAD) • Is the intervention or exposure that causes the changes in the dependent variable. drug, surgery, radiation, smoking …
Type of variables …? Categorical variables… Qualitative or attribute variable • Nonnumeric gender, severity of injury, type of injury, tumour grade Quantitative variable • Numeric • Discrete variable can assume only whole numbers: number of accidents, number of injuries, pain score • Continuous variable may take any value, within a defined range: weight, height, age, blood pressure, level of cholesterol, pain score
Qualitative/Categorical Quantitative/Numeric Level of measurement … • There are four level of measurement: • Nominal • Ordinal • Interval • Ratio
Variable type: Nominal Ordinal . Interval . Ratio Assumptions: Named categories Same as nominal plus ordered categories Same as ordinal plus equal intervals Same as interval plus meaningful zero Level of measurement … cont’d
Level of measurement … cont’d • A nominal variable: consists of named categories, with no implied order among the categories. - gender, mortality ---- dichotomous or binary - type of injury, type of fracture, blood type • An ordinal variable: consists of ordered categories, where the differences between categories cannot be considered to be equal. - Tumour stage – I, II, III, IV, tumour grade – I II, III, IV - Likert scale – excellent, very good, good, fair, poor
Level of measurement … cont’d • An interval variable: has equal distances between values with no meaningful ‘zero’ value. - IQ test (the differences between numbers are meaningful but the ratios between them are not) • An ratio variable: has equal intervals between values and a meaningful zero point. The ratio between them makes sense. - height, weight, laboratory test values, age
For example Primary objective:To compare the post-operative pain between laparoscopic and open surgery in patients with colorectal cancer Secondary objective:To compare the post-operative complications between laparoscopic and open surgery in patients with colorectal cancer
Independent (Explanatory) variables: Age, Sex, Pre-op pain Severity Dependent/outcome variables: Changes in pain, Complication Independent (Comparison) variable
Data Editing • Validity edits: Ensure that: • essential fields have been completed and there are no missing information • specified units of measure have been properly used and the measurements are within the acceptable range. • Duplication edits: Ensure that each case/patient have been entered into the database only once. • Statistical edits: Identify and double check all the extreme values, suspicious data and outliers.
… are a means of organizing and summarizing observations. We examine variables in order to describe their main features. It is the basic strategies that help us organize our exploration of a set of data: Begin by examining each variable. Examine the distribution of each variable by creating frequency tables, numerical summaries and graphs. Study the relationships between the variables. Descriptive Statistics
Examining Distributions: Categorical … • Numbers • Frequencies (counts), cumulative frequencies • Relative frequencies (%), cumulative relative frequencies (%) • Graphs • Bar charts • Pie charts
Examining Distributions: Categorical … Numbers • Frequencies (counts), cumulative frequencies • Relative frequencies (%), cumulative relative frequencies (%) • Graphs • Bar charts • Pie charts
Bar charts … • A bar chart can be used to depict any levels of measurement (nominal, ordinal, interval, or ratio). • A series of separated bars (vertical or Horizontal), one per category. • Bars represent frequency (counts) or relative frequency (percent or proportion) of each category. • A Bar chart is also useful for showing data for more than one group.
Pie charts … • Used primarily for nominal and ordinal data. • Used to display relative frequency distribution. • The circle is divided proportionally using relative frequency of each category. • A piechart is useful for showing data for one group but it is useless for graphic illustration of two or more groups.
Examining Distributions: Quantitative … • Numbers • Measures of central tendency – mean, median, mode • Measures of variation around mean – variance, standard deviation, standard error of mean • Measures of variation around median – percentiles, quintiles, quartiles • Graphs • Histograms • The five-number summary Box plots
Measures of central tendency • Mean: sum of observations divided by number of observations • Median: is a midpoint of a distribution after arranging all observations in order of size, from smallest to largest. • Mode: most frequent value – the highest peak
Properties of mean … • It is used for interval or ratio data. • A set of data has only a mean. • All values are included in the computation. • It is the only measure of central tendency where the sum of deviations of each value from the mean will always be zero. • The mean is a useful measures for comparing two or more sets of data. • The mean is sensitive toward extreme values.
Properties of median … • It is used for interval or ratio data. • There is a unique median for each data set. • The median is not necessarily equal to one of the sample values. • It is resistant (insensitive) toward extreme values. • It is useful for summarising skewed data.
Measures of variation around mean • Variance: the average of the squares of the deviations of the data from their mean • Standard deviation: square root of variance • Standard error:
Properties of variance … • All values are used on calculation. • The units are not the same as data, they are the square of the original units.
Properties of standard deviation … • The units are the same as data • It is used for Empirical Rule. • For any symmetrical distribution: • About 68% of the observations will lie within 1 s. d. of the mean. • About 95% of the observations will lie within 2 s. d. of the mean. • About 99.8% of the observations will lie within 3 s. d. of the mean.
Measures of variation around median • Percentiles: • Arrange the observations from smallest to largest. • Divide into 100 equal parts; • for example; the 5th percentiles of a distribution is the value which 5% of the observations fall below and 95% fall above. • Quartiles: 25th, 50th and 75th percentiles • Quintiles: 20th, 40th, 60th, and 80th percentiles • Deciles: 10th, 20th, 30th, 40th, 50th,……10th percentiles
Examining Distributions: Quantitative … • Numbers • Measures of central tendency; mean, median, mode • Measures of variation around mean – variance, standard deviation, standard error of mean • Measures of variation around median – percentiles, quintiles, quartiles • Graphs • Histograms • The five-number summary Boxplot
Outliers?? Histogram
Histograms … • Used for interval and ratio data. • A histogram is a graph in which each bar (horizontal axis) represent a range of numbers called interval width. The vertical axis represents the frequency of each interval. • There are no spaces between bars. • Histogram is useful for graphic illustration of one group.
100th Inner fence Outliers Whiskers Q3 Median/Q2 Q1 Whiskers Inner fence 1st Box plot: 5 – number summary Range = Max - Min IQR = Q3 – Q1
Box Plots … • Used for interval and ratio data. • Uses the five-number summary measures Median, Q1, Q3, minimum and maximum. • It is useful in detecting outliers • It is useful to illustrate the distribution of more than on group.
What are outliers … ? Outliers are extreme data values that fall outside of distribution of the data set.
100th Inner fence Whiskers Q3 Median/Q2 Q1 Whiskers Inner fence 1st Box plot: 5 – number summary IQR = Q3 – Q1
1.5 IQR Criterion for Outliers • Interquartile range (IQR) is the distance between the first and third quartiles.IQR = Q3 – Q1 • From data Q1= 59 yrs, Q3 = 70 yrs, IQR = 70 – 59 = 11 1.5 IQR = 1.5 11 = 16.5 Q1 – IQR = 59 – 16.5 = 42.5 Q3+ IQR = 70 + 16.5 = 86.5 • From data: Min= 44 and Max = 82
Properties of quartiles, quintiles… • It is used for interval or ratio data. • It is resistant (insensitive) to extreme values. • It is useful for summarising skewed data.
How to deal with skewed data • Transform the data: • Square/square root – (Poisson) count data • Log(x) or ln(x) – data is skewed toward right • Reciprocal (1/X) - data is skewed toward left • Transformation: • Make skewed data more symmetric • Makes distribution more normal • Stabilize variability • Liberalize a relationship between two or more variables • Show summary stat in original but analyse on the transformed data
Summary of what we have learned …. • Always plot your data: make a graph, e.i. histogram, box plot • Look for overall pattern (shape, centre and spread) and for striking deviations such as outliers • Check to see if overall pattern of distribution can be described by normal distribution. • If not uniform, transform data to make skewed data more symmetric • Calculate an appropriate numerical summary to describe centre and spread