120 likes | 239 Views
CSCI N207 Data Analysis Using Spreadsheet. 10a. Univariate Analysis Part 1. Lingma Acheson linglu@iupui.edu. Department of Computer and Information Science, IUPUI. Univariate Analysis. Deal with single variable, one data field.
E N D
CSCI N207 Data Analysis Using Spreadsheet 10a. Univariate Analysis Part 1 Lingma Acheson linglu@iupui.edu Department of Computer and Information Science, IUPUI
Univariate Analysis • Deal with single variable, one data field. • Apply some calculation to describe the data in the field, e.g. central tendency, location, dispersion, etc. • Is also called descriptive statistics. • First learn the concepts, then use Excel as a tool.
The Nature of Measurement • Measurement is a process of assigning a number ora value to observations according to some established set of rules. • The numbers with a quantitative bases are amendable to mathematical analysis. • Need to decide on the scale of measurement. • E.g. Age: shall we use categories such as “<10”, “10 – 20”, “21 – 30”, etc., or shall we use the actual value? Rankings : shall we use 1, 2, 3, 4, or A, B, C, D? Salary: shall we use 1000 as the unit so the value are 21, 48, etc., or the actual value such as 21000, 48000, etc.? • Based on the purpose for collecting the values, what type of analysis to be performed, etc.
The Nature of Measurement • The measurement process is influenced by many factors and environmental conditions. E.g. • Experimental error • Instrumental error • Incompleteness • Need to consider the validity and accuracy of measurement.
Validity of Measurement • What is an appropriate and meaningful way to measure a given property? • E.g. Measure the area of a rectangular table. • What tool to use – ruler, tape? • What system to use – metric system or the British system? • How to come up with the area – further calculation is needed (indirect measurement)? • Measurement of social and behavioral sciences are mostly indirect. E.g. What is a good way to measure how rich a family is? Are drivers over 65 involved in more fatal accidents than drivers below 17?
Accuracy of Measurement • The quality of measurement. • Inaccuracy may be caused by • systematic error, e.g. a weighing scale always reads a certain number of pounds low. • Incompleteness, e.g. small sample size. • Lower precision than what’s required, e.g. need a result in millimeters, but use a ruler with only centimeters. • Physical measurement is more straightforward than social science.
Descriptive Statistics • Methodology to observe, describe or summarize your data. • Central tendency • Mean • Median • Mode • Dispersion • Min/Max • Range • Variance • Standard Deviation • Distribution • Univariate analysis, summarize data in one data field
The Mean (Average) • Sum of all values divided by the number of values in the data set. • One measure of central location in the data set. Mean = Mean =(73+66+69+67+49+60+81+71+78+62+ 53+87+74+65+74+50+85+45+63+100)/20 = 68.6 • Excel function: AVERAGE()
0 25 75 100 65 0 25 75 100 65 The Mean (Average) • The data may or may not be symmetrical around its average value. Mean itself does not tell what your data looks like.
The Median • The middle value in a sorted data set. Half the values are greater and half are less than the median. • Another measure of central location in the data set. • Odd number of items: find the middle number. E.g. (1, 2, 4, 7, 8, 9, 9) Median: 7 • Even number of items: find the middle two and get the average of the two. E.g. (45, 49, 50, 53, 60, 62, 63, 65, 66, 67, 69, 71, 73, 74, 74, 78, 81, 85, 87, 100) Median: 68 • Excel function: MEDIAN()
The Mode • Most frequently occurring value. • Another measure of central location in the data set. E.g. (45, 49, 50, 53, 60, 62, 63, 65, 66, 67, 69, 71, 73, 74, 74, 78, 81, 85, 87, 100) Mode: 74 • Generally not all that meaningful unless a larger percentage of the values are the same number. • Excel function: MODE()