420 likes | 569 Views
ST3905. Lecturer : Supratik Roy Email : s.roy@ucc.ie (Unix) : supratik@stat.ucc.ie Phone: ext. 3626. What do we want to do?. What is statistics? Describing Information : Summarization, Visual and non-Visual representation Drawing conclusion from information :
E N D
ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ie (Unix) : supratik@stat.ucc.ie Phone: ext. 3626
What do we want to do? • What is statistics? • Describing Information : • Summarization, Visual and non-Visual representation • Drawing conclusion from information : • Managing uncertainty and incompleteness of information
Describing Information • Why summarization of information? • Visual representation (aka graphical Descriptive Statistics) • Non-visual representation (numerical measures) • Classical techniques vs modern IT
Stem and Leaf Plot Decimal point is 2 places to the right of the colon 0 : 8 1 : 000011122233333333333344444 1 : 55555566666677777778888888899999999999 2 : 0000000111111111111222222233333333444444444 2 : 555556666666666777778889999999999999999 3 : 000000001111112222333333333444 3 : 55555555666667777777888888899999999 4 : 0122234 4 : 55555678888889 5 : 111111134 5 : 555667 6 : 44 6 : 7
Rules for Histograms • Height of Rectangle proportional to frequency of class • No. of classes proportional to sqrt(total no. of observations) [not a hard and fast rule] • In case of categorical data, keep rectangle widths identical, and base of rectangles separate. • Best, if possible, let the software do it.
Data -0.053626486 -0.828128399 0.214910482 0.346570399 [5] -0.849316517 0.001077376 0.736191791 1.417540397 [9] -2.382332275 -2.699019949 -0.111907192 1.384903284 [13] 2.113286699 -1.828108272 -1.108280724 0.131883612 [17] -0.394494473 0.829806888 0.023178033 0.019839537 [21] -0.346280222 -0.251981108 1.159853307 -0.249501904 [25] -1.342704742 -2.012653224 -1.535503208 0.869806233 [29] -1.313495887 -0.244408426 -0.998886998 -1.446769605 [33] 1.224528053 -0.410163230 0.032230907 -0.137297112 [37] -2.717620031 -0.728570438 0.034697116 2.202863874 [41] -0.170794163 0.353651680 -0.673296374 3.136364814 [45] -1.260108638 -0.367334893 -0.652217259 -0.301847039 [49] 0.315180215 0.190766333
Non-Visual (numerical measures) • Pictures vs. quantitative measures • Criteria for selection of a measure – purpose of study • Qualities that a measure should have • We live in an uncertain world – chances of error
Measures of Location • Mean : • Mode • Median
Location : mean, median algebra test scores 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 43 50 41 69 52 38 51 54 43 47 54 51 70 58 44 54 52 32 42 70 21 22 23 24 25 50 49 56 59 38 Mean = 50.68 10% trimmed mean of scores = 50.33333 Median = 51
Location : Non-classical An M-estimate of location is a solution mu of the equation: sum(psi( (y-mu)/s )) = 0. Data set : car.miles (bisquare) 204.5395 (Huber’s ) 204.2571
Measures of Scale (aka Dispersion) • Variance (unbiased) : sum((x-mean(x))^2)/(N-1) • Variance (biased) : sum((x-mean(x))^2)/(N) • Standard Deviation : sqrt( variance)
Robust measures of scale • The MAD scale estimate generally has very small bias compared with other scale estimators when there is "contamination" in the data. • Tau-estimates and A-estimates also have 50% breakdown, but are more efficient for Gaussian data. • The A-estimate that scale.a computes is redescending, so it is inappropriate if it necessary that the scale estimate always be increasing as the size of a datapoint is increased. However, the A-estimate is very good if all of the contamination is far from the "good" data.
Comparison of scale measures MAD(corn.yield) =4.15128 scale.tau(corn.yield) = 4.027753 scale.a(corn.yield) = 4.040902 var(corn.yield) = 19.04191 sqrt(var(corn.yield)) = 4.363703 N.B. To really compare you have to compare for various probability distributions as well as various sample sizes.
Probability • Concept of an Experiment on Random observables • Sets and Events, Random variables, Probability (a).Set of all basic outcomes = Sample space = S (b).An element of S or union of elements in S = An event (Asingleton event = simple event, else compound) (c) A numerical function that associates an event with a number(s) = Random Variable (d) A map from E onto [0,1] obeying certain rules = probability
Examples of Probability • Consider toss of single coin : • A single throw : Only two possible outcomes – Head or Tail • Two consecutive throws : Four possible outcomes – (Head, Head), (Head, Tail), (Tail, Head), (Tail, Tail) • Unbiased coin : P(Head turns up) = 0.5 • Define R.V. X to be X(Head)=1, X(Tail)=0. P(X=1)=0.5, P(X=0)=0.5.
Axioms of Probability • 0 <= P(A) <= 1 for any event A • P[A B] = P[A]+P[B] if A,B are disjoint sets/events • P[S] =1
Basic Formulae-I • P[A’] = 1- P[A] • P[A B] = 0 if A,B are disjoint • P[A B] = P[A]+P[B]-P[A B] • P[A B C] = P[A]+P[B]+ P[C] • -P[A B] –P[A C] – P[B C] • +P[A B C]
Basic Formulae - II • Counting Principle : For an ordered sequence to be formed from N groups G1,G2,….GN with sizes k1,k2,….kN, the total no. of sequences that can be formed are k1 x k2 x ….kN. • An ordered sequence of k objects taken from a set of n distinct objects is called a Permutation of size k of the objects, and is denoted by Pk,n. • For any positive integer m, m! is read as “m-factorial” and defined by m!=m(m-1)(m-2)…3.2.1 • Any unordered subset of size k from a set of n distinct objects is called a Combination, denoted Ck,n.
Basic Formulae-III • Pk,n = n!/(n-k)! • Ck,n = n!/[k!(n-k)!] • For any two events A and B with P(B)>0, the Conditional Probability of A given (that ) B (has occurred)is defined by P(A|B) = P(A B)/P(B) [=0 if P(B)=0] • Let A,B be disjoint and C be any event with P[C]>0. Then P(C)=P(C|A)P(A)+P(C|B)P(B) [Law of Total Probability] • Let A,B be disjoint and C be any event with P[C]>0. Then P(A|C)=P(C|A)P(A)/[P(C|A)P(A)+P(C|B)P(B)]. [Bayes Theorem]
Random Variables - Discrete • A discrete set is a set such that either it is finite or there exists a map from each element of the set into a subset of the set of Natural numbers. • A discrete random variable is a r.v. which takes values in a discrete set consisting of numbers. • The probability distribution or probability mass function (pmf) of a discrete r.v. X is defined for every number x by p(x)=P(X=x)=P(all s S: X(s)=x) [P[X=x] is read “the probability that the r.v. X assumes the value x”. Note, p(x) >= 0, sum of p(x) over all possible x is 1
Cumulative Distribution Function • The Cumulative distribution function (cdf) F(x) of a discrete r.v. X with pmf p(x) is defined for every number x by F(x)=P(Xx)={y : y x} p(y) • For any number x, F(x) is the probability that the observed value of X will be at most x. • For any two numbers a,b with a b, P(a X b) = F(b)-F(a-) where a- represents the largest possible X value that is strictly less than a.
Operations on RV’s • Expectation of a RV • Expectations of functions of RV’s • Special Cases : Moments, Covariance
Expected Values of Random Variables • Let X be a discrete r.v. with set of possible values D and pmf p(x). The expected value or mean value of X, denoted by E(X) or X , is E(X) = X ={xD} x.p(x) • Note that E(X) may not always exists. Consider p(x)=k/x2
Expected Values of functions of Random Variables • Let X be a discrete r.v. with set of possible values D and pmf p(x). The expected value or mean value of f(X), denoted by E(f(X)) or f(X) , is E(f(X)) ={xD} f(x).p(x) • Example : Variance. Var(X)=V(X)=E[X-E(X)]2=E(X2)-[E(X)]2