1 / 42

ST3905

ST3905. Lecturer : Supratik Roy Email : s.roy@ucc.ie (Unix) : supratik@stat.ucc.ie Phone: ext. 3626. What do we want to do?. What is statistics? Describing Information : Summarization, Visual and non-Visual representation Drawing conclusion from information :

jamese
Download Presentation

ST3905

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ie (Unix) : supratik@stat.ucc.ie Phone: ext. 3626

  2. What do we want to do? • What is statistics? • Describing Information : • Summarization, Visual and non-Visual representation • Drawing conclusion from information : • Managing uncertainty and incompleteness of information

  3. Describing Information • Why summarization of information? • Visual representation (aka graphical Descriptive Statistics) • Non-visual representation (numerical measures) • Classical techniques vs modern IT

  4. Stem and Leaf Plot Decimal point is 2 places to the right of the colon 0 : 8 1 : 000011122233333333333344444 1 : 55555566666677777778888888899999999999 2 : 0000000111111111111222222233333333444444444 2 : 555556666666666777778889999999999999999 3 : 000000001111112222333333333444 3 : 55555555666667777777888888899999999 4 : 0122234 4 : 55555678888889 5 : 111111134 5 : 555667 6 : 44 6 : 7

  5. Pie-Chart

  6. DotChart

  7. Histogram

  8. Histogram-Categorical

  9. Rules for Histograms • Height of Rectangle proportional to frequency of class • No. of classes proportional to sqrt(total no. of observations) [not a hard and fast rule] • In case of categorical data, keep rectangle widths identical, and base of rectangles separate. • Best, if possible, let the software do it.

  10. Data -0.053626486 -0.828128399 0.214910482 0.346570399 [5] -0.849316517 0.001077376 0.736191791 1.417540397 [9] -2.382332275 -2.699019949 -0.111907192 1.384903284 [13] 2.113286699 -1.828108272 -1.108280724 0.131883612 [17] -0.394494473 0.829806888 0.023178033 0.019839537 [21] -0.346280222 -0.251981108 1.159853307 -0.249501904 [25] -1.342704742 -2.012653224 -1.535503208 0.869806233 [29] -1.313495887 -0.244408426 -0.998886998 -1.446769605 [33] 1.224528053 -0.410163230 0.032230907 -0.137297112 [37] -2.717620031 -0.728570438 0.034697116 2.202863874 [41] -0.170794163 0.353651680 -0.673296374 3.136364814 [45] -1.260108638 -0.367334893 -0.652217259 -0.301847039 [49] 0.315180215 0.190766333

  11. Tabulation

  12. Box-Plot - I

  13. Box Plot – II

  14. Box Plot – III

  15. Non-Visual (numerical measures) • Pictures vs. quantitative measures • Criteria for selection of a measure – purpose of study • Qualities that a measure should have • We live in an uncertain world – chances of error

  16. Measures of Location • Mean : • Mode • Median

  17. Location : mean, median algebra test scores 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 43 50 41 69 52 38 51 54 43 47 54 51 70 58 44 54 52 32 42 70 21 22 23 24 25 50 49 56 59 38 Mean = 50.68 10% trimmed mean of scores = 50.33333 Median = 51

  18. Location : Non-classical An M-estimate of location is a solution mu of the equation: sum(psi( (y-mu)/s )) = 0. Data set : car.miles (bisquare) 204.5395 (Huber’s ) 204.2571

  19. Tabular method of computing

  20. Tabular method of computing

  21. Measures of Scale (aka Dispersion) • Variance (unbiased) : sum((x-mean(x))^2)/(N-1) • Variance (biased) : sum((x-mean(x))^2)/(N) • Standard Deviation : sqrt( variance)

  22. Tabular method of computing

  23. Robust measures of scale • The MAD scale estimate generally has very small bias compared with other scale estimators when there is "contamination" in the data. • Tau-estimates and A-estimates also have 50% breakdown, but are more efficient for Gaussian data. • The A-estimate that scale.a computes is redescending, so it is inappropriate if it necessary that the scale estimate always be increasing as the size of a datapoint is increased. However, the A-estimate is very good if all of the contamination is far from the "good" data.

  24. Comparison of scale measures MAD(corn.yield) =4.15128 scale.tau(corn.yield) = 4.027753 scale.a(corn.yield) = 4.040902 var(corn.yield) = 19.04191 sqrt(var(corn.yield)) = 4.363703 N.B. To really compare you have to compare for various probability distributions as well as various sample sizes.

  25. Probability • Concept of an Experiment on Random observables • Sets and Events, Random variables, Probability (a).Set of all basic outcomes = Sample space = S (b).An element of S or union of elements in S = An event (Asingleton event = simple event, else compound) (c) A numerical function that associates an event with a number(s) = Random Variable (d) A map from E onto [0,1] obeying certain rules = probability

  26. Examples of Probability • Consider toss of single coin : • A single throw : Only two possible outcomes – Head or Tail • Two consecutive throws : Four possible outcomes – (Head, Head), (Head, Tail), (Tail, Head), (Tail, Tail) • Unbiased coin : P(Head turns up) = 0.5 • Define R.V. X to be X(Head)=1, X(Tail)=0. P(X=1)=0.5, P(X=0)=0.5.

  27. Axioms of Probability • 0 <= P(A) <= 1 for any event A • P[A  B] = P[A]+P[B] if A,B are disjoint sets/events • P[S] =1

  28. Basic Formulae-I • P[A’] = 1- P[A] • P[A  B] = 0 if A,B are disjoint • P[A  B] = P[A]+P[B]-P[A  B] • P[A  B  C] = P[A]+P[B]+ P[C] • -P[A  B] –P[A  C] – P[B  C] • +P[A  B  C]

  29. Basic Formulae - II • Counting Principle : For an ordered sequence to be formed from N groups G1,G2,….GN with sizes k1,k2,….kN, the total no. of sequences that can be formed are k1 x k2 x ….kN. • An ordered sequence of k objects taken from a set of n distinct objects is called a Permutation of size k of the objects, and is denoted by Pk,n. • For any positive integer m, m! is read as “m-factorial” and defined by m!=m(m-1)(m-2)…3.2.1 • Any unordered subset of size k from a set of n distinct objects is called a Combination, denoted Ck,n.

  30. Basic Formulae-III • Pk,n = n!/(n-k)! • Ck,n = n!/[k!(n-k)!] • For any two events A and B with P(B)>0, the Conditional Probability of A given (that ) B (has occurred)is defined by P(A|B) = P(A  B)/P(B) [=0 if P(B)=0] • Let A,B be disjoint and C be any event with P[C]>0. Then P(C)=P(C|A)P(A)+P(C|B)P(B) [Law of Total Probability] • Let A,B be disjoint and C be any event with P[C]>0. Then P(A|C)=P(C|A)P(A)/[P(C|A)P(A)+P(C|B)P(B)]. [Bayes Theorem]

  31. Random Variables - Discrete • A discrete set is a set such that either it is finite or there exists a map from each element of the set into a subset of the set of Natural numbers. • A discrete random variable is a r.v. which takes values in a discrete set consisting of numbers. • The probability distribution or probability mass function (pmf) of a discrete r.v. X is defined for every number x by p(x)=P(X=x)=P(all s  S: X(s)=x) [P[X=x] is read “the probability that the r.v. X assumes the value x”. Note, p(x) >= 0, sum of p(x) over all possible x is 1

  32. Cumulative Distribution Function • The Cumulative distribution function (cdf) F(x) of a discrete r.v. X with pmf p(x) is defined for every number x by F(x)=P(Xx)={y : y  x} p(y) • For any number x, F(x) is the probability that the observed value of X will be at most x. • For any two numbers a,b with a  b, P(a  X  b) = F(b)-F(a-) where a- represents the largest possible X value that is strictly less than a.

  33. Operations on RV’s • Expectation of a RV • Expectations of functions of RV’s • Special Cases : Moments, Covariance

  34. Expected Values of Random Variables • Let X be a discrete r.v. with set of possible values D and pmf p(x). The expected value or mean value of X, denoted by E(X) or X , is E(X) = X ={xD} x.p(x) • Note that E(X) may not always exists. Consider p(x)=k/x2

  35. Expected Values of functions of Random Variables • Let X be a discrete r.v. with set of possible values D and pmf p(x). The expected value or mean value of f(X), denoted by E(f(X)) or  f(X) , is E(f(X)) ={xD} f(x).p(x) • Example : Variance. Var(X)=V(X)=E[X-E(X)]2=E(X2)-[E(X)]2

  36. Random Variables - Continuous

  37. Joint distribution of >1 RV’s

  38. Gaussian or Normal Distribution

  39. Sample as Random Observables

  40. Parametric Inference

  41. Tests of Hypothesis

  42. Hypothesis Tests for Normal Population

More Related