600 likes | 698 Views
E xploratory Data Analysis ( Descriptive Statistics ). Martina Litschmannová m artina.litschmannova @vsb.cz EA 538. Statistics has two major chapters:. Descriptive Statistics Inferential statistics. Statistics. Descriptive Statistics
E N D
ExploratoryDataAnalysis(DescriptiveStatistics) Martina Litschmannová martina.litschmannova@vsb.cz EA 538
Statistics has two major chapters: • Descriptive Statistics • Inferential statistics
Statistics Descriptive Statistics Gives numerical and graphic procedures to summarize a collection of data in a clear and understandable way. Inferential Statistics Provides procedures to draw inferences about a population from a sample.
Populations vs. Sample • A population includes each element from the set of observations that can be made. • A sample consists only of observations drawn from the population. Exploratory Data Analysis sampling sample population InferentialStatistics
Variable Avariable has two defining characteristics: • A variable is an attribute that describes a person, place, thing, or idea. • The value of the variable can "vary" from one entity to another.
Exploratory data analysis • Statistical tools that help examine data in order to describe their main features. • Basic strategy • Examine variables one by one, then look at the relationships among the different variables. • Start with graphs, then add numerical summaries of specific aspects of the data.
Exploratory data analysis - One variable • Graphical displays • Qualitative/categorical data: bar chart, pie chart, etc. • Quantitative data: histogram, boxplotetc. • Summary statistics • Qualitative/categorical: frequency tables • Quantitative: mean, median, standard deviation, range etc.)
Summary of categorical variables • Numerically: tables with total counts andpercents, mod • Graphically • Bar graphs, pie charts • Bar graph nearly always preferable to a pie chart. It is easier to compare bar heights compared to slices of a pie
Statisticalcharacteristics We summarize categorical data using a table. Note that percentages are often called Relative Frequencies + Mod (a variant that occurs most frequently)
Statisticalcharacteristics Mod = Male
Graphical Methods of Presenting Qualitative Variables • Bar chart is a standard graph,where variants of the variable are represented on one axis and variable frequencies on the other axis. Individual values of the frequency are then displayed as bars (boxes, vectors, squared logs, cones, etc.)
A bar chart is made up of columns plotted on a graph. • The columns are positioned over a label that represents a categorical variable. • The height of the column indicates the size of the group defined by the column label. Attention! We subjectively take notice the volume, rather than the height of the shape!!!
Graphical Methods of Presenting Qualitative Variables • Bar chart is a standard graph where variants of the variable are represented on one axis and variable frequencies on the other axis. Individual values of the frequency are then displayed as bars (boxes, vectors, squared logs, cones, etc.) • Pie Chart represents relative frequencies of individual variants of a variable. Frequencies are presented as proportions in a sector of a circle.
Use data http://homel.vsb.cz/%7Elit40/STA1/STA-EN/pima-indians-diabetes.xlsx Find the frequency table for dataset structure depending on the information if patients have or have not diabetes. Draw the bar chart and pie chart.
Quantitative variables • Numerical sumary • Mean • Median • Quartiles • Range • Standard deviation… • Graphical summary • Histogram • Box plot…
Quantitativemeasures When you compare two or more data sets, focus on four features: • Center • Spread • Shape. • Unusual features
MeasuresofCentralTendency Mean • To find the mean of a set of observations, add their values and divide by the number of observations. Meanof a population: Meanof a sample:
Meanexample The average age of 20 people in a room is 25. A 28 year old leaves while a 30 year old enters the room. • Does the average age change? • If so, what is the new average age?
MeasuresofCentralTendency Median • The median is the midpoint of a distribution • The number such that half the observations are smaller and the other half are larger. • Also called the 50th percentile or 2ndquartile. • To compute a median • Order observations. • If number of observations is odd the median is the center observation. • If number of observations is even the median is the average of the two center observations. Medianof a population: Medianof a sample:
Medianexample The median age of 21 people in a room is 25. A 28 year old leaves while a 30 year old enters the room. • Does the median age change? • If so, what is the new median age? The median age of 20 people in a room is 25. A 28 year old leaves while a 30 year old enters the room. • Does the median age change? • If so, what is the new median age?
Mean vs. median • When histogram is symmetric mean and median are similar. • Mean and median are different when histogram is skewed. • Skewed to the right mean is larger than median. • Skewed to the left mean is smaller than median.
Mean vs. median Extremeexample • Income in small town of 6 people: $25,000 $27,000 $29,000 $35,000 $37,000 $38,000 Mean is $31,830 and median is $32,000. • Bill Gates moves to town. $25,000 $27,000 $29,000 $35,000 $37,000 $38,000 $40,000,000 Mean is $5,741,571 median is $35,000. • Mean is pulled by the outlier while the median is not. The median is a better of measure of center for these data.
Effect of Changing Units Howmeasures of central tendency are affected when we change units (minutes to hours, feet to metersetc.)? • If you add a constant to every value, the mean and median increase by the same constant. • If you multiplyeveryvalue by a constant, the mean and median willalsobemultipliedby thatconstant.
Effect of Changing Units - example The average annual temperature in Prague is 10 ° C. Whatistheaverage annual temperature in Prague in degrees Fahrenheit?
Is a central measure enough? • A warm, stable climate greatly affects some individual’s health. Atlanta and San Diego have about equal average temperatures (62o vs. 64o). If a person’s health requires a stable climate, in which city would you recommend they live?
Is a central measure enough? • A warm, stable climate greatly affects some individual’s health. Atlanta and San Diego have about equal average temperatures (62o vs. 64o). If a person’s health requires a stable climate, in which city would you recommend they live? Atlanta San Diego
Measures of spread • Range • difference between the largest and smallest values in a set of values. • Inter-quartile range • lowerquartilis the "middle" value in the first half of the rank-ordered data set. • upperquartilis the "middle" value in the second half of the rank-ordered data set.
Measures of spread • Variance • In a population, variance is the average squared deviation from the population mean, as defined by the following formula: . • Sample variance is defined by slightly different formula, and uses a slightly different notation: . • Standard deviation • The standard deviation looks at how far observations are from their mean. Population: Sample:
Measures of spread - example • A population consists of four observations: {1, 3, 5, 7}. What is the variance? • A simple random sample consists of four observations: {1, 3, 5, 7}. Based on these sample observations, what is the best estimate of the standard deviation of the population?
Effect of Changing Units Howmeasures of spreadaffected when we change units (minutes to hours, feet to metersetc.)? • If you add a constant to every value, the distance between values does not change. As a result, all of the measures of variability (range, interquartile range, standard deviation, and variance) remain the same. • Supposeyou multiply every value by a constant. This has the effect of multiplying the range, interquartile range (IQR), and standard deviation by that constant. It has an even greater effect on the variance. It multiplies the variance by the square of the constant..
Effect of Changing Units - example The variance annual temperature in Prague is 0,25(° C)2. Whatisthe varianceannual temperature in Prague in square degrees Fahrenheit?
Measuresofposition • Percentiles • Assume that the elements in a data set are rank ordered from the smallest to the largest. The values that divide a rank-ordered set of elements into 100 equal parts are called percentiles. • Quartiles (lowerquartil, median, upperquartil) • Assume that the elements in a data set are rank ordered from the smallest to the largest. The values that divide a rank-ordered set of elements into 4equal parts are called quartiles. • Standard Scores (z-scores) • z-scoreindicates how many standard deviations an element is from the mean. A standard score can be calculated from the formula:
How to interpret z-score? • … an element less than the mean. • … an element greater than the mean. • … an element equal to the mean. • … an element that is 1 standard deviation greater than the mean; , 2 standard deviations greater than the mean; etc. • … an element that is 1 standard deviation less than the mean; , 2 standard deviations less than the mean; etc. • If the number of elements in the set is large, about 68% of the elements have a z-score between -1 and 1; about 95% have a z-score between -2 and 2; and about 99% have a z-score between -3 and 3. • … an element is outlier
z-score - Example A national achievement test is administered annually to 3rd graders. The test has a mean score of 100 and a standard deviation of 15. If Jane's z-score is 1.20, what was her score on the test?
Graphical Methods of Presenting Qualitative Variables • Histograms- made up of columns plotted on a graph • There is no space between adjacent columns. • The columns are positioned over a label that represents a quantitative variable. • The column label can be a single value or a range of values. • The height of the column indicates the size of the group defined by the column label.
Histograms • Where did the bins come from? • They were chosen rather arbitrarily. • Does choosing other bins change the picture? • Yes!! And sometimes dramatically. • What do we do about this? • Some pretty smart people have come up with some “optimal” bin widths and we will rely on there suggestions. Optimalnumberofbins: (Sturges rule)
Histogram • The purpose of a graph is to help us understand the data. • After you make a graph, always ask, “What do I see?” • Once you have displayed a distribution you can see the important features.
Histograms We will describe the features of the distribution that the histogram is displaying with three characteristics. • Shape • Center • Spread • UnusualFeatures
Histograms Shape • Symmetry - when it is graphed, a symmetric distribution can be divided at the center so that each half is a mirror image of the other.
Histograms Shape • Number of peaks. • Distributions with one clear peak are called unimodal. • Distributions with two clear peaks are called bimodal. • When a symmetric distribution has a single peak at the center, it is referred to as bell-shaped.
Histograms Shape • Skewness- when they are displayed graphically, some distributions have many more observations on one side of the graph than the other. • Distributions with most of their observations on the left (toward lower values) are said to be skewed right. • Distributionswith most of their observations on the right (toward higher values) are said to be skewed left.