410 likes | 751 Views
Basic Data Analysis and Graphs I. SPSS Training Thomas V. Joshua , MS July 2012. College of Nursing. Lecture Overview. Descriptive Statistics Explore Plots Histograms Box Plot Scatter Plot. Descriptive Statistics.
E N D
Basic Data Analysis and Graphs I SPSS Training Thomas V. Joshua, MS July 2012 College of Nursing
Lecture Overview • Descriptive Statistics • Explore Plots • Histograms • Box Plot • Scatter Plot
Descriptive Statistics • A starting point for any data analysis entails becoming familiar with the dataset itself. Exploratory Analysis, Initial Data Analysis etc. all refer to the required portion of the research process in which we come to understand the variables involved. • A common first step in data analysis is to summarize information about variables in your dataset, such as the averages and variances of variables. • Several summary or descriptive statistics are available under the Descriptives option. For example, Analyze -> Descriptive Statistics -> Descriptives • Analysis in SPSS can take several forms, and as this is not designed to be a statistics course, demonstration will be limited to providing a feel for the Analyze and Graph menu.
Descriptive Statistics • Only the default statistics have been selected (mean, standard deviation, minimum, and maximum); however, there are several others that could be selected. • A value of zero for the skewness indicates a symmetric distribution. A value of zero for the kurtosis indicates the a shape (flatness or peakedness) close to normal. The normal distribution has kurtosis of zero.
Descriptive Statistics • The more that individual data points differ from the mean, the larger the standard deviation will be. Conversely, if there is a great deal of similarity between data points, the standard deviation will be quite small. • Because equal variances is an assumption of many inferential statistics, this information is important to a data analyst.
Descriptive Statistics - Explore • Frequencies are commonly selected, particularly for categorical data. • Explore allows for a breakdown according to a factor variable without needing to use the split file option in the data menu. • Graphical methods visualize the distribution of a random variable and compare the distribution to a theoretical one using plots. These methods are either descriptive or theory-driven.
SPSS provides trimmed means and M-estimators but considering that robust procedures that would actually use such values are nonexistent in this package, one wonders why they bothered. The 'outliers' option is not too informative in many cases as it just shows the 5 highest and lowest values, of which none may qualify for being a true outlier.
Descriptive Statistics - Explore The 'Plots' button brings up the third image, where one can implement tests of normality and obtain graphical displays of the distribution of the data. What does power estimation mean? Right click on Power estimation. The definition will pop out.
Explore – Histogram It appears that the distribution of the currently salary is positively skewed (to the right) and is least likely to be normally distributed. • skewness =2.125 • Kurtosis = 5.378
Explore – Stem-and-Leaf Plot Current Salary Stem-and-Leaf Plot Frequency Stem & Leaf 33.00 1 . 56667789999 110.00 2 . 00001111111222222222333334444444444 115.00 2 . 555555556666666667777777778888889999999 80.00 3 . 000000000001111112233333444 32.00 3 . 55556677889 20.00 4 . 0001233& 12.00 4 . 5678& 12.00 5 . 0124& 7.00 5 . 556 53.00 Extremes (>=56750) Stem width: 10000 Each leaf: 3 case(s) & denotes fractional leaves. • When N is small, this plot is useful to summarize data. • A stem-and-Leaf plot works well for continuous or event count variable.
Explore – Box Plot • Variability The length of the box represents the difference between the 25th and 75th percentiles. The larger the box, the greater the spread of the data.
Explore – Box Plot (con’t) • Outliers • Case numbers are used to label outliers (o) and extremes (*). The outliers are cases with the values between 1.5 and 3 box-lengths from the 75th percentile or 25th percentile. The extreme values are cases with the values more than 3 box-lengths from the 75th percentile or 25th percentile. • Note that there are multiple outliers detected in our example. The outliers may be 1) due to recoding errors, 2) due to the sample being from a skewed population distribution or 3) not being from the same population, or 4) simply due to the small sample size.
Theory-driven PlotsP-P Plot and Q-Q Plot • Can also be obtained by click Analyze -> Descriptive Statistics -> P-P Plots (or Q-Q Plots) • The probability-probability plot (P-P plot or percent plot) compares an empirical cumulative distribution function of a variable with a specific theoretical cumulative distribution function (e.g., the standard normal distribution function). • Although visually appealing, the graphical methods do not provide objective criteria to determine normality of variables. Interpretations are thus a matter of judgments.
Explore with Factors Factor – Categorical variable
Explore with Factors The side-by-side boxplots for the three groups on the dependent variable, Salary.
Graphs • Interactive • Legacy Dialogs
Graphs • Each of the available options provides a visual display of the data. For example, Graph -> Legacy Dialogs -> Histogram… • If you have continuous data (such as salary) you can also use the Histograms option and its suboption, With normal curve, to allow you to assess whether your data are normally distributed, which is an assumption of several inferential statistics. • You can also use the Explore procedure, available from the Descriptive Statistics menu, to obtain the Kolmogorov-Smirnov test, which is a hypothesis test to determine if your data are normally distributed.
Graphs - Legacy Dialogs Histogram
Graphs - Interactive Histogram
Graphs - Legacy Dialogs Boxplot Label cases by Gender
Graphs - Interactive Boxplot Paneled by Gender
Scatter Plot Graphs -> Legacy Dialogs -> Scatter/Dot • To create a scatter plot of current salary by education, select the Simple Scatter option and then click the Define button to produce the dialog box. • If one variable can be theoretically conceptualized as causing the other, then the causal variable would typically be placed on the X axis, and the outcome variable on the Y axis.
Scatter Plot • Note that there are five scatterplot options. • The Simple option graphs the relationship between two variables. • The Matrix option is for two or more variables that you want graphed in every combination: variable is plotted with every other variable. Every combination is plotted twice so that each variable appears on both the X and Y axis.
Scatter Plot - Matrix • For example, if you specified a Matrix scatterplot with three variables, salary, salbegin, and jobtime, you would receive the following scatterplot matrix:
Scatter Plot - Overlay • It allows you to plot two scatterplots on top of each other. • The plots are distinguished by color on the overlaid plot.
Scatter Plot - Overlay In this graph, the green points represent the educxsalbegin plot whereas the blue points represent the educxsalary plot.
Scatter Plot - 3-D scatterplot The fourth option for scatterplots is the 3-D scatterplot. This is used to plot three variables in three dimensional space. Here is an example of the 3-D option, containing the variables, salbegin (X), salary (Y), and educ (Z).
Scatter Plot • Some of the most useful options for modifying your scatterplot are only available after you have the initial scatterplot created.
Scatter Plot • From this window, you have several options for modifying your chart. We will only deal with scatterplot-specific options here. • To get the scatterplot options, select Options or Elements from the Chart menu: • One most useful option that will add information to your scatterplot is the Fit Line options. This option will allow you to plot a regression line over your scatter plot.