Chapter 7: Statistical Analysis with Excel

Spreadsheet-Based Decision Support Systems Chapter 7: Statistical Analysis with Excel Prof. Name name@email.com Position (123) 456-7890 University Name

Overview • 7.1 Introduction • 7.2 Understanding Data • 7.3 Relationships in Data • 7.4 Distributions • 7.5 Summary

Introduction • Performing basic statistical analysis of data using Excel functions • Statistical features of the Data Analysis ToolPak • Trend curves for analyzing data patterns • Basic linear regression techniques in Excel • Several different distribution functions in Excel

Understanding Data • Statistical Functions • Descriptive Statistics • Histograms

Statistical Functions • AVERAGE • Finds the mean of a set of data. • =AVERAGE(range or range_name) • MEDIAN • Finds the middle number in a list of sorted data. • =MEDIAN(range or range_name) • STDEV • Finds the standard deviation of a set of data. • This is equal to the square root of the variance, which measures the difference between the mean of the data set and the individual values. =STDEV.P(range or range_name) =STDEV.S(range or range_name)

Figures 7.1 and 7.2

Data Analysis ToolPak • An Excel Add-In which includes several statistical analysis techniques • To ensure that it is an active Add-in: • Display Excel Options dialog box: • Select Options from the list of options in the File tab. • Select the Add-Ins tab on the left side of the dialog box. • Select Analysis ToolPak listed on the Add-ins window.

Descriptive Statistics • Provides a list of statistical information about your data set including • Mean • Median • Standard deviation • Variance • Click on Data > Analysis > Data Analysis command to display the Data Analysis dialog box. • Choose the Descriptive Statistics option and click OK.

Descriptive Statistics (cont’d) • The Input Range refers to the location of the data set. • Check the option button Columns or Rows to indicate how your data is grouped. • If there are labels in the first row of each column of data, then check the Labels in First Row box. • The Output Range refers to where the results of the analysis will be displayed in the current worksheet. • Check the Summary Statistics box to calculate the most commonly used statistics.

Figure 7.7 • Quarterly stock returns for three different companies are recorded. We want to know • Average stock return • Variability of stock returns • Which quarters had the highest and lowest stock returns

Figure 7.10 • Almost all of the data points lie between +2s and –2s from the mean. • Outliers are data that are inconsistent with the main pattern of data.

Figure 7.11 • The standard deviation is used to identify outliers in a data set.

Figure 7.12 • Conditional Formatting with the Formula Is option is used to identify outliers. • Select the column of values in the data set; and fill in the Conditional Formatting dialog box to highlight outlier points.

Figure 7.13 • The cell that holds an outlier is highlighted.

More Descriptive Statistics • Confidence Level for Mean • The mean is calculated using the specified confidence level (for example, 95% or 99%), the standard deviation, and the size of the sample data. • The confidence level and calculated mean are then added to the analysis report. • You can compare the actual mean to this calculated mean based on the specified confidence level. • Kth Largest • Gives the largest ranked data value for a specified value of k. • For k = 1, the maximum data value would be returned. • Kth Smallest • Gives the smallest ranked data value for a specified value of k. • For k = 1, the minimum data value would be returned.

Descriptive Statistics Functions • PERCENTILE.INC • Returns a value for which a desired percentile k of the specified data_set falls below. = PERCENTILE.INC(data_set, k) • For example, for the MSFT data, the value for which 95% of the data falls below is =PERCENTILE.INC(B4:B27,0.95) = 0.108 • PERCENTILE.EXC • Excludes the value of k-th percentile from the calculations = PERCENTILE.EXC (data_set, k) • For the MSFT data, the value for which 95% of the data falls below is =PERCENTILE.EXC(B4:B27,0.95) = 0.135

Descriptive Statistics Functions (cont’d) • PERCENTRANK.INC • Returns the percentile of the data_set which falls below a given value. =PERCENTRANK.INC(data_set, value) • For example, percent of the MSFT data falls below the value 0.108, inclusive of 0.108 is =PERCENTRANK.INC(B4:B27, 0.108) = 0.95, or 95% • PERCENTRANK.EXC • Calculates the same percentile, exclusive of the value of k. =PERCENTRANK.EXC(data_set, value) • For example, percent of the MSFT data falls below the value 0.135, exclusive of 0.135 is =PERCENTRANK.EXC(B4:B27, 0.135) = 0.95, or 95%

Histograms • Histograms calculate the number of occurrences, or frequency, with which values in a data set fall into various intervals. • Choose the Histogram option from the Analysis ToolPak list.

Histograms (cont’d) • The Input Range is the range of the data set. • The Bin Range is used to specify the location of the bin values. • Bins are the intervals into which values can fall; they can be defined by the user or can be evenly distributed among the data by Excel. • The Output Range is the location of the output, or the frequency calculations for each bin. • The chart options include a simple Chart Output (the actual histogram), CumulativePercentage for each bin value, and a Pareto organization of the chart.

Figures 7.17 and 7.18 • To create your own bin values, make a list of upper bounds for each interval.

Figure 7.19

Histograms (cont’d) • To change the format of a Histogram: • Click on the histogram to activate the Chart Tools contextual tabs. • Use the commands listed on these tabs to change the design, layout and format of the histogram.

Histograms (cont’d) • There are four basic shapes to a histogram: • Symmetric: has peaks and dips with equal amplitude • A curve with only one peak is also symmetric; that is, there is a central high part and almost equal lower parts to the left and right of this peak. • Positively skewed: has a peak on the left and many lower points (stretching) to the right. • Negatively skewed: has a peak on the right and many lower points (stretching) to the left. • Multiple peaks: imply that more than one source, or population, of data is being evaluated.

Relationships in Data • Trend Curves • Regression

Data Relationships • Relationships in data are usually identified by comparing two variables: the dependent variableand theindependent variable. • The dependent variable is the variable we are most interested in. By understanding its current behavior we can better predict its future behavior. • The independent variable is the variable we use as the comparison in order to make this prediction.

Trend Curves • Trend curves are used to graph and analyze these relationships between data. • Trend curves graph the data with • The independent variable on the x-axis • The dependent variable on the y-axis • To add a trend curve to your chart: • Click on the data points in an XY Scatter chart to activate Chart Tools contextual tabs. • Click on the Chart Tools Layout > Analysis > Trendline command. • Select a trend curves from the trendlines options listed.

Trend Curves (cont’d) • There are six types of trend curves which Excel can model: • Exponential • Linear • Logarithmic • Polynomial • Power • Moving Average

Trend Curves (cont’d) • Double click on a trendline to activate the Format Trendline dialog box. • We can modify: • The type of the trendline by selecting one of the options listed. • The trendline’s name. • We can specify a period forward or backward for which we want to predict the behavior of our dependent variable.

Linear Trend Curves • Number of Units Produced each month and the corresponding Monthly Plant Cost are recorded. • The company needs to estimate plant costs based on the planned production amounts. • The dependent variable is therefore the Monthly Plant Cost and the independent variable is the Units Produced.

Figure 7.25 • Begin this analysis by making an XY Scatter chart of the data.

Figure 7.26 • Right-click on any of the data points and choose Add Trendline from the short-cut menu. • The Format Trendline dialog box appears. • Select Linear from the Types listed. • Select Display Equation on Chart checkbox.

Figure 7.27 • The trendline and the equation are then added to the chart.

Figure 7.28 • Use the displayed equation to predict future values. • Check the accuracy of the equation by calculating the error from the known data. • Linear trends have the relationship: y = a*x - b

Figure 7.29 • Copy the formula for “Predicted Cost” to the rest of the rows to calculate the predicted monthly costs.

Exponential Trend Curves • Sales data for ten years is recorded. • We want to predict sales for the next few years. • The independent variable is Years and our dependent variable is Sales.

Figure 7.31 • Exponential trends have the following relationship: • y = a*e^(b*x)or • y = a*EXP(b*x) • Build a XY Scatter chart of the data. • Right-click on a data point to add the trendline. • Choose the Exponential curve to fit the data.

Figure 7.34 • We use the formula to predict sales values for future years. • However, the Exponential trend curve has a sharply increasing slope that may not be accurate for many situations.

Power Trend Curves • We are given yearly Production values and yearly Unit Cost for production. • We want to determine the relationship between Unit Cost and Production in order to predict future Unit Costs.

Figure 7.36 • Power trends have the relationship: y = a*x^b • Begin by creating the XY Scatter chart. • Right-click on a data point to add a trendline. • Choose a Power curve to fit the data.

Regression Analysis • We can use some regression analysis parameters to ensure that the relationships we have chosen for our data are “good” fits. • These parameters include • R-Squared value • Standard error • Slope • Intercept

R-Squared Value • The R-Squared value measures the amount of influence the independent variable has on the dependent variable. • The closer the R-Squared value is to 1, the stronger the relationship is between the independent and dependent variables. • If the R-Squared value is closer to 0, then there may not be a relationship between these two variables.

Figure 7.39 • We fit a Linear trendline to the Monthly Plant Cost per Units Produced chart. • The R-Squared value is 0.8137, which is fairly close to 1, implying a good fit.

Figure 7.40 • We fit an Exponential trendline to the Sales per year chart. • The R-Squared value is 0.9828, which is fairly close to 1, implying a sound fit.

Figure 7.41 • We fit a Power trendline to the Unit Cost per Cumulative Production chart. • The R-Squared value is 0.9062, which is fairly close to 1, implying a good fit.

Figure 7.42 • The RSQ Excel function can calculate the R-squared value from a set of data. • =RSQ(y_range, x_range) • Note that this function only works with Linear trend curves.

Chapter 7: Statistical Analysis with Excel

Chapter 7: Statistical Analysis with Excel

Presentation Transcript

Exploring Microsoft Excel

EXCEL Templates

Meta-analysis

Multivariate Data Analysis Using SPSS

Enriching Microsoft PowerPivot for Microsoft Excel Applications Using DAX (Data Analysis Expressions)

Excel Chapter 3

Statistical Evaluation of Data

Statistical Inference and Regression Analysis: GB.3302.30

Statistical Inference and Regression Analysis: GB.3302.30

EXCEL 2003 vs. 2007

Statistical Quality Control

Instructions for Operation of Excel Sensitivity and Scenario Analysis Used in the Course

SPSS

INPO Statistics Workshop: Notes and Exercises using EXCEL and SAS

Multivariate Data Analysis Using SPSS