1 / 33

Initial Data Analysis

Initial Data Analysis. Frequency. IDA. Often overlooked or sloughed off as being not all that important but…

whistler
Download Presentation

Initial Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Initial Data Analysis Frequency

  2. IDA • Often overlooked or sloughed off as being not all that important but… • It is at the beginning stages where much trouble can be avoided and if the data is glossed over this can lead to missed findings or results that will not be able to be replicated because they represent bad data. • Bad data?

  3. IDA includes: • A healthy inspection of the individual variables’ behaviors • Outlier analysis • Descriptive and graphical output

  4. Describing and Exploring Data • Once a bunch of data has been collected, the raw numbers must be manipulated in some fashion to make them more informative. • Several options are available includingplotting the data or calculating descriptive statistics.

  5. Plotting Data • Often, the first thing one does with a set of raw data is to plot frequency distributions. • Usually this is done by first creating a table of the frequencies broken down by values of the relevant variable, then the frequencies in the table are plotted in a histogram.

  6. Frequency Data • Example: Age as estimated by a questionnaire in a statistics class. • Note: The frequencies in the adjacent table were calculated by simply counting the number of subjects having the specified value for the age variable.

  7. Grouping data • Plotting is easy when the variable of interest has a relatively small number of values (like our age variable did). • However, the values of a variable are sometimes more continuous, resulting in uninformative frequency plots if done in the above manner.

  8. Grouped Frequency DistributionExample: Binning our weight variable. • For example, with a variable like weight we might obtain a range from 100 lb. to 200 lb. If we used the previously described technique, we would end up with 100 bars, most of which with a frequency less than 2 or 3 (and many with a frequency of zero). • We can get around this problem by grouping our values into bins. Try for around 10 classes (or bins) with natural splits.

  9. Graphic Depiction of Frequency • Histogram • Similar to a bar chart with the only difference being that histograms are representative of non-nominal data. • Age example 

  10. Weight example • Check out this demowhich clearly shows how the width of the bin that you select can clearly affect the “look” of the data • Here is another similar demonstration of the effects of bin width

  11. Number of Classes and Class Width • The number of classes should be between 5 and 15. • Fewer than 5 classes cause excessive summarization. • More than 15 classes tends not to add much. • Class Width • Divide the range by the number of classes for an approximate class width • Round up to a convenient number

  12. 42 26 32 34 57 30 58 37 50 30 53 40 30 47 49 50 40 32 31 40 52 28 23 35 25 30 36 32 26 50 55 30 58 64 52 49 33 43 46 32 61 31 30 40 60 74 37 29 43 54 Example of Ungrouped Data Scores on a social introversion inventory

  13. Relative Frequency Relative Class Interval Frequency Frequency 20-under 30 6 .12 30-under 40 18 .36 40-under 50 11 .22 50-under 60 11 .22 60-under 70 3 .06 70-under 80 1 .02 Total 50 1.00

  14. Cumulative Frequency Cumulative Class Interval Frequency Frequency 20-under 30 6 6 30-under 40 18 24 40-under 50 11 35 50-under 60 11 46 60-under 70 3 49 70-under 80 1 50 Total 50

  15. Class Midpoints, Relative Frequencies, and Cumulative Frequencies Relative Cumulative Class Interval Frequency Midpoint Frequency Frequency 20-under 30 6 25 .12 6 30-under 40 18 35 .36 24 40-under 50 11 45 .22 35 50-under 60 11 55 .22 46 60-under 70 3 65 .06 49 70-under 80 1 75 .02 50 Total 50 1.00

  16. Cumulative Relative Frequencies Cumulative Relative Cumulative Relative Class Interval Frequency Frequency Frequency Frequency 20-under 30 6 .12 6 .12 30-under 40 18 .36 24 .48 40-under 50 11 .22 35 .70 50-under 60 11 .22 46 .92 60-under 70 3 .06 49 .98 70-under 80 1 .02 50 1.00 Total 50 1.00

  17. Histogram Construction Class Interval Frequency 20-under 30 6 30-under 40 18 40-under 50 11 50-under 60 11 60-under 70 3 70-under 80 1

  18. Frequency Polygon Class Interval Frequency 20-under 30 6 30-under 40 18 40-under 50 11 50-under 60 11 60-under 70 3 70-under 80 1

  19. Advantages/Disadvantages • With the grouped frequency distribution we can take large data sets and make them much more manageable and easier to understand. • However, we also lose information about individual data points.

  20. Stem and Leaf Plots • If values of a variable must be grouped prior to creating a frequency plot, then the information related to the specific values becomes lost in the process (i.e., the resulting graph depicts only the frequency values associated with the grouped values). • However, it is possible to obtain the graphical advantage of grouping and still keep all of the information if stem & leaf plots are used.

  21. Stem and Leaf Plots • These plots are created by splitting a data point into that part associated with the ‘group’ and that associated with the individual point. • For example, the numbers 180, 180, 181, 182, 185, 186, 187, 187, 189 could be represented as: • 18 001256779

  22. 86 77 91 60 55 76 92 47 88 67 23 59 72 75 83 77 68 82 97 89 81 75 74 39 67 79 83 70 78 91 68 49 56 94 81 Stem Leaf Raw Data 2 3 4 5 6 7 8 9 3 9 7 9 5 6 9 0 7 7 8 8 0 2 4 5 5 6 7 7 8 9 1 1 2 3 3 6 8 9 1 1 2 4 7

  23. 86 77 91 60 55 76 92 47 88 67 23 59 72 75 83 77 68 82 97 89 81 75 74 39 67 79 83 70 78 91 68 49 56 94 81 Construction of Stem and Leaf Plot Stem Leaf Raw Data 2 3 4 5 6 7 8 9 3 9 7 9 5 6 9 0 7 7 8 8 0 2 4 5 5 6 7 7 8 9 1 1 2 3 3 6 8 9 1 1 2 4 7 Stem Leaf Stem Leaf

  24. Thus, we could represent our weight data in the following stem & leaf plot:

  25. Stem & leaf plots are especially nice for comparing distributions.

  26. Advantages • Using a stem and leaf offers several advantages • It retains individual data points • Displays large amounts of data well (compared to a normal frequency distribution) • Provides a ‘graphical’ display of the data • Disadvantage • Kind of ugly

  27. Terminology Related to Distributions • Often, frequency histograms tend to have a roughly symmetrical bell-shape and such distributions are called normal or gaussian.

  28. Sometimes, the bell shape is not symmetrical. • The term positive skew refers to the situation where the “tail” of the distribution is to the right, negative skew is when the “tail” is to the left.

  29. Example: Pizza Data

  30. Distribution Shapes

More Related