620 likes | 629 Views
Learn about the basics of descriptive statistics, including measures of central tendency, variability, and levels of measurement. Understand the difference between quantitative and categorical variables. Explore the concept of skewness in distributions.
E N D
بسم الله الرّحمن الرّحيم www.biostat.ir
Biostatistics Academic Preview Descriptive Statistics www.biostat.ir
What Is Statistics? • Statistics is the science of describing or making inferences about the world from a sample of data. • Descriptive statistics are numerical estimates that organize and sum up or present the data. • Inferential statistics is the process of inferring from a sample to the population. www.biostat.ir
Statistics has two major chapters: • Descriptive Statistics • Inferential statistics www.biostat.ir
Two types of Statistics • Descriptive statistics • Used to summarize, organize and simplify data • What was the average height score? • What was the highest and lowest score? • What is the most common response to a question? • Inferential statistics • Techniques that allow us to study samples and then make generalizations about the populations from which they were selected • Are 5th grade boys taller than 5th grade girls? • Does a treatment suitable? www.biostat.ir
Population and Samples The Population under study is the set off all individuals of interest for the research. That part of the population for which we collect measurements is called sample. The number of individuals in a sample is denoted by n. www.biostat.ir
Variables www.biostat.ir
Definitions • Variable: a characteristic that changes or varies over time and/or different subjects under consideration. • Changing over time • Blood pressure, height, weight • Changing across a population • gender, race www.biostat.ir
Types of variables www.biostat.ir
Types of variables :Definitions • Quantitative variables (numeric):measure a numerical quantity of amount on each experimental unit • Qualitative variables (categorical):measure a non numeric quality or characteristic on each experimental unity by classifying each subject into a category www.biostat.ir
Types of variables :Quantitative variables • Discrete variables:can only take values from a list of possible values • Number of brushing per day • Continuous variables: can assume the infinitely many values corresponding to the points on a line interval • weight, height www.biostat.ir
Types of variables :Categorical variables • Nominal:unordered categories • Race • Gender • Ordinal:ordered categories • likert scales( disagree, neutral, agree ) • Income categories www.biostat.ir
Types of Variables • A discrete variable has gaps between its values. For example, number of brushing per day is a discrete variable. • A continuous variable has no gaps between its values. All values or fractions of values have meaning. Age is an example of continuous variable. www.biostat.ir
Levels of Measurement • Reflects type of information measured and helps determine what descriptive statistics and which statistical test can be used. www.biostat.ir
Four Levels of Measurement Nominal lowest level, categories, no rank Ordinal second lowest, ranked categories Interval next to highest, ranked categories with known units between rankings Ratio highest level, ranked categories with known intervals and an absolute zero www.biostat.ir
Temperature Men/Women Good/Better/Best Weight Republicans/Democrats/ Independents Volume IQ Not at all/A little/A lot Interval Nominal Ordinal Ratio Nominal Ratio Interval Ordinal Scales of Measurement www.biostat.ir
Descriptive Measures • Central Tendency measures. They are computed in order to give a “center” around which the measurements in the data are distributed. • Relative Standing measures. They describe the relative position of a specific measurement in the data. • Variation or Variability measures. They describe “data spread” or how far away the measurements are from the center. www.biostat.ir
Measures of Central Tendency • Mean: Sum of all measurements in the data divided by the number of measurements. • Median: A number such that at most half of the measurements are below it and at most half of the measurements are above it. • Mode: The most frequent measurement in the data. www.biostat.ir
Summary Statistics: Measures of central tendency (location) • Mean: The mean of a data set is the sum of the observations divided by the number of observation • Population mean: Sample mean: • Median: The median of a data set is the “middle value” • For an odd number of observations, the median is the observation exactly in the middle of the ordered list • For an even number of observation, the median is the mean of the two middle observation is the ordered list • Mode: The mode is the single most frequently occurring data value www.biostat.ir
Skewness • The skewness of a distribution is measured by comparing the relative positions of the mean, median and mode. • Distribution is symmetrical • Mean = Median = Mode • Distribution skewed right • Median lies between mode and mean, and mode is less than mean • Distribution skewed left • Median lies between mode and mean, and mode is greater than mean www.biostat.ir
Relative positions of the mean and median for (a) right-skewed, (b) symmetric, and(c) left-skewed distributions Note: The mean assumes that the data is normally distributed. If this is not the case it is better to report the median as the measure of location. www.biostat.ir
Frequency Distributions and Histograms Histograms for symmetric and skewed distributions. www.biostat.ir
Normal curvessame mean but different standard deviation www.biostat.ir
Further Notes • When the Mean is greater than the Median the data distribution is skewed to the Right. • When the Median is greater than the Mean the data distribution is skewed to the Left. • When Mean and Median are very close to each other the data distribution is approximately symmetric. www.biostat.ir
Summary statisticsMeasures of spread (scale) • Variance: The average of the squared deviations of each sample value from the sample mean, except that instead of dividing the sum of the squared deviations by the sample size N, the sum is divided by N-1. • Standard deviation: The square root of the sample variance • Range: the difference between the maximum and minimum values in the sample. www.biostat.ir
Summary statistics: measures of spread (scale) • We can describe the spread of a distribution by using percentiles. • The pth percentile of a distribution is the value such that p percent of the observations fall at or below it. • Median=50th percentile • Quartiles divide data into four equal parts. • First quartile—Q1 • 25% of observations are below Q1 and 75% above Q1 • Second quartile—Q2 • 50% of observations are below Q2 and 50% above Q2 • Third quartile—Q3 • 75% of observations are below Q3 and 25% above Q3 www.biostat.ir
Q2 Q1 Q3 25% 25% 25% 25% Quartiles www.biostat.ir
Five number system • Maximum • Minimum • Median=50th percentile • Lower quartile Q1=25th percentile • Upper quartile Q3=75th percentile www.biostat.ir
Graphical display of numerical variables(histogram) Class Interval Frequency 20-under 30 6 30-under 40 18 40-under 50 11 50-under 60 11 60-under 70 3 70-under 80 1 www.biostat.ir
Frequency Distributions and Histograms A histogram of the compressive strength data with 17 bins. www.biostat.ir
Frequency Distributions and Histograms A histogram of the compressive strength data with nine bins. www.biostat.ir
Frequency Distributions and Histograms Histogram of compressive strength data. www.biostat.ir
Q3 Q1 Q2 Minimum Maximum Graphical display of numerical variables(box plot) Median www.biostat.ir
S < 0 S > 0 S = 0 Symmetric (Not Skewed) Positively Skewed Negatively Skewed Graphical display of numerical variables(box plot) www.biostat.ir
Univariate statistics(categorical variables) • Summary measures • Count=frequency • Percent=frequency/total sample • The distribution of a categorical variable lists the categories and gives either a count or a percent of individuals who fall in each category www.biostat.ir
Displaying categorical variables www.biostat.ir
Response and explanatory variables • Response variable: the variable which we intend to model. • we intend to explain through statistical modeling • Explanatory variable: the variable or variables which may be used to model the response variable • values may be related to the response variable www.biostat.ir
Bivariate relationships • An extension of univariate descriptive statistics • Used to detect evidence of association in the sample • Two variables are said to be associated if the distribution of one variable differs across groups or values defined by the other variable www.biostat.ir
Bivariate Relationships • Two quantitative variables • Scatter plot • Side by side stem and leaf plots • Two qualitative variables • Tables • Bar charts • One quantitative and one qualitative variable • Side by side box plots • Bar chart www.biostat.ir
Two quantitative variablesCorrelation A relationship between two variables. Explanatory (Independent)Variable Response (Dependent)Variable y x Hours of Training Number of Accidents Shoe Size Height Cigarettes smoked per day Lung Capacity Height IQ What type of relationship exists between the two variables and is the correlation significant? www.biostat.ir
Scatter Plots and Types of Correlation x = hours of training y = number of accidents Accidents Negative Correlation as x increases, y decreases www.biostat.ir
Scatter Plots and Types of Correlation x = SAT score y = GPA GPA Positive Correlation as x increases y increases www.biostat.ir
Scatter Plots and Types of Correlation x = height y = IQ IQ No linear correlation www.biostat.ir
1 -1 0 Correlation Coefficient A measure of the strength and direction of a linear relationship between two variables The range of r is from -1 to 1. If r is close to -1 there is a strong negative correlation If r is close to 1 there is a strong positive correlation If r is close to 0 there is no linear correlation www.biostat.ir
Positive and negative correlation 1 If two variables x and y are positively correlated this means that: • large values of x are associated with large values of y, and • small values ofx are associated with small values ofy 2 If two variables x and y are negatively correlated this means that: • large values of x are associated with small values of y, and • small values of x are associated with large values of y www.biostat.ir
Positive correlation www.biostat.ir
Negative correlation www.biostat.ir
Two qualitative variables(Contingency Tables) • Categorical data is usually displayed using a contingency table, which shows the frequency of each combination of categories observed in the data value • The rows correspond to the categories of the explanatory variable • The columns correspond the categories of the response variable www.biostat.ir
Example • Aspirin and Heart Attacks • Explanatory variable=drug received • placebo • Aspirin • Response variable=heart attach status • yes • no www.biostat.ir