Statistical Methods in Computer Science

Descriptive StatisticsData 1: Frequency Distributions Ido Dagan Statistical Methods in Computer Science

Concrete Theory: Relates Variables to Each Other • Examples: • Mathematically accurate • Memory = 2*sizeof(input) + 3 • Runtime = 500 + 30*sizeof(input) + 20 • Asymptotically correct • Memory = O(sizeof(input)) in worst case, • Runtime = O(log (sizeof(input))) in best case • Accuracy is proportional to run-time • Qualitative • User performance is increased with reduced cognitive load • number of bugs discovered is monotonically decreasing, but positive, if the same programmer is used, otherwise, it increases

Behavior Parameters/Variables(typical of Computer Science) • Hardware parameters • CPU model and organization, cache organization, latencies in the system System parameters • Memory availability, usage • CPU running time (sometimes approximated by world-clock time) • Communication bandwidth, usage • Program characteristics • requires floating-point, heavy disk usage, integer math, graphics • large heap, large stack, uses non-local information, ...

Scales of Measurements • Nominal (also called categorical): No order, just labels • e.g., “Algorithm Name” • Ordinal (also called rank): Order, but not numerical • Difference between ranks is not necessarily the same • e.g., ranks in (hierarchical/military) organization • Interval: Difference between values has same meaning everywhere • e.g., temperature in Celsius (rise of 10 degrees is the same everywhere) • But 100C is not twice as hot as 50C, and 0C is not lack of heat • Ratio: Interval + Fixed zero point • e.g., robot position, memory usage, run-time

Scale Hierarchy • Nominal < Ordinal < Interval < Ratio • Propositions that are true for some level, are true above it • But not necessarily the other way around • e.g., we can calculate the mean (average) value for numerical variables • But not for nominal and ordinal • e.g., we can calculate the most frequent value for all variables http://en.wikipedia.org/wiki/Levels_of_measurement “Numerical”

Variables • Discrete: • Can take on only certain values: symbols, exact numbers • For ordinal, interval and ratio scales, this means there will be gaps • e.g., User satisfaction surveys, memory usage Continuous: • Can take on any value within its range: no gaps • e.g., run-time, CPU temperature, robot velocity and position • In practice: limited by measurement accuracy • Up to researcher to determine needed accuracy

Data • The collection of values that a variable X took during the measurement

Describing Data • Our task: • Describe the data we have collected • Find ways to characterize it, represent it • Find properties that are true of the data

Data Distribution • The collection of data is called the sample distribution • We will investigate distributions: • Find values that “best” represent a distribution • Measure their dispersion, range, shape • Identify extraordinary values in a distribution • Find visual representations for a distribution • Remember hierarchy: Nominal < Ordinal < Interval < Ratio • Think about how the following techniques apply

Frequency Distribution • Examine the frequency of values • f(x) = # of times variable took on value x.

Frequency Distribution • Examine the frequency of values • f(x) = # of times variable took on value x. ?

Frequency Distribution • Examine the frequency of values • f(x) = # of times variable took on value x. Convention (Ordinal/Numerical): Sort by value

Grouped Frequency Distributions • In ordinal/numerical variables, possible to group values together • Create Grouped Frequency Distributions

Grouped Frequency Distributions • In ordinal/numerical variables, possible to group values together • Create Grouped Frequency Distributions Warning: Loss of Information

Real and Apparent Limits • Continuous values are more difficult to divide into intervals • Score of 95 falls within 95-99, not within 90-94 • But what about temperature of 94.87 ? 94 < 94.87 < 95 ! • By convention, the real limits of a score are within ½ the measurement resolution • If our resolution is 0.1, then limits are within 0.05 • If our resolution 100, then limits are within 50 • Note: we break convention only for exceptional cases • e.g., age: “I am 35” is true of [35.0 .. 36.0)

Real/Apparent Limits • For example: • Resolution of 0.01. Interval 95..99 really covers values 94.995 to 99.005 • Apparent limits: 95..99 • Real limits: 94.995 to 99.005 • Resolution of 10: 740-800 really covers values 735 to 805.

Relative Frequency Distributions • A frequency count can be misleading • Algorithm X was fastest on 60,000 trials: Is this good? • 100,000 people voted for candidate A: Is she the winner? • Relative frequency distributions: translate f into percentage or ratio • rel f (proportion) = f/N • rel f (%) = 100 * f/N • Warning: Can be misleading, if ignoring count magnitude • 50% of all test cases succeeded (with only two cases…)

Relative Frequency Distributions • Example: f/N

Cumulative Frequency Distribution • For ordinal/numerical variables • Where values are with respect to others: How many below or above Cumulative frequency distribution

Cumulative Frequency Distribution • Based on the cumulative distribution, can answer question such as: • What percentage of scores fall below 80? • How many scores below 95?

Percentiles, Percentile Ranks • (Value of) Percentile X: Value for which X percent of values are lower • e.g. baby height • We use Px to denote the Xth percentile, e.g., P98 is in range 90-94. • Percentile rank of value X: the percent of values that fall below X. • e.g., percentile rank of the interval 65-69 is 12.

Computing Percentiles, P. Ranks • How do we compute percentiles and percentile ranks from grouped data? • What is the score which defines the top 20% of scores? • Is it between 84 and 85?

Computing Percentiles • We want to compute P80. 80% of 50 cases = 40 cases. • We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit).

Computing Percentiles • We want to compute P80. 80% of 50 cases = 40 cases. • We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit). • We need 8 more.

Computing Percentiles • We want to compute P80. 80% of 50 cases = 40 cases. • We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit). • We need 8 more. • The interval 85-89 contains 47-32 = 15 cases. • real limit 84.5 • These are spread over width of 5 (= 89.5-84.5). • Assume scores are evenly distributed within interval • 8 more cases ==> 8/15 * 5 = 2.67 (linear interpolation) • P80 = 84.5 + 2.67 = 87.17

Computing Percentile Ranks • We want to compute the percentile rank of 86 • Lies in the interval 85-89, real limits 84.5 – 89.5. • 86-84.5 = 1.5 score points. • Width of interval = 5. Assuming uniform spread of scores in interval:1.5/5 = 0.3 ==> 30% of scores in interval (0.3*15 = 4.5)

Computing Percentile Ranks • We want to compute the percentile rank of 86 • Lies in the interval 85-89, real limits 84.5 – 89.5. • 86-84.5 = 1.5 score points. • Width of interval = 5. 1.5/5 = 0.3 ==> 30% of scores in interval (0.3*15 = 4.5) • So we have 32 scores up to 84.5 • 4.5 scores from 84.5 to 86. • Total: 4.5 + 32 = 36.5 scores. • 36.5/50 = 73%. This is the percentile rank of 86.

Frequency Distributions and Scales

Displaying Frequency Distributions:Nominal Data

Displaying Frequency Distributions:Ordinal/Numerical Data • Histogram

Displaying Frequency Distributions:Ordinal/Numerical Data • Histogram: Different Grouping

Lying with Visuals

Characteristics of Distributions • Shape, Central Tendency, Variability Different Central Tendency Different Variability

Statistical Methods in Computer Science