1 / 29

Random Variables and Distributions

Random Variables and Distributions. COMP5318 Knowledge Discovery and Data Mining. Examples. Examples. We have heard of statements like “ Height is Normally Distributed ”. Standard deviation. mean. Why distributions are important.

ordell
Download Presentation

Random Variables and Distributions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Random Variables and Distributions COMP5318 Knowledge Discovery and Data Mining

  2. Examples

  3. Examples • We have heard of statements like “Height is Normally Distributed” Standard deviation mean

  4. Why distributions are important • Distribution capture the essence of data associated with a particular variable(s) (e.g., height). • If we know height is Normally distributed then a small random sample is enough to provide a very good idea about the general population. • Can answer questions like: what is the probability of finding a 2 meter tall Australian? • Need to understand the concept of random variable.

  5. Random Variable • Let S be the sample space. • A random variable X is a function X: SReal Suppose we toss a coin twice. Let X be the random variable number of heads

  6. Random Variable(Number of Heads in two coin tosses) We also associate a probability with X attaining that value.

  7. Random Variable(Number of Heads in two coin tosses)

  8. Random Variables follow a Distribution • The height of Australian soldiers is a random variable which follows a Normal distribution with mean 180 cm and standard deviation 15 cm. • The frequency of words in a text is a random variable which follows a Zipf distribution. • The speed of a hurricane is a random variable which follows a Cauchy distribution. • The number of car accidents in a fixed time duration is a random variable which follows a Poisson distribution. • The number of heads in a sequence of coin tosses is a random variable which follows a Binomial distribution. • The number of web hits in a given time period is a r.v. which follows a Pareto distribution. • Many times we don’t know what named distribution a r.v. follows or whether it follows any named distribution at all!

  9. Distribution Definitions • Discrete Probability Distribution • Continuous Probability Distribution • Cumulative Distribution Function

  10. Discrete Distribution • A r.v. X is discrete if it takes countably many values {x1,x2,….} • The probability function or probability mass function for X is given by • fX(x)= P(X=x) • From previous example

  11. Continuous Distributions • A r.v. X is continuous if there exists a function fX such that

  12. Example: Continuous Distribution • Suppose X has the pdf • This is the Uniform (0,1) distribution

  13. Binomial Distribution • A coin flips Heads with probability p. Flip it n times and let X be the number of Heads. Assume flips are independent. • Let f(x) =P(X=x), then

  14. Binomial Example • Let p =0.5; n = 5 then • In Matlab >>binopdf(4,5,0.5)

  15. Normal Distribution • X has a Normal (Gaussian) distribution with parameters μ and σ if • X is standard Normal if μ =0 and σ =1. It is denoted as Z. • If X ~ N(μ, σ2) then

  16. Normal Example • The number of spam emails received by a email server in a day follows a Normal Distribution N(1000,500). What is the probability of receiving 2000 spam emails in a day? • Let X be the number of spam emails received in a day. We want P(X = 2000)? • The answer is P(X=2000) = 0; • It is more meaningful to ask P(X >= 2000);

  17. Normal Example • This is • In Matlab: >> 1 –normcdf(2000,1000,500) • The answer is 1 – 0.9772 = 0.0228 or 2.28% • This type of analysis is so common that there is a special name for it: cumulative distribution function F.

  18. Outliers • In data mining we are often interested in outliers • especially in high dimensional data which we cannot easily visualize • A knowledge of distributions can be very useful in this context. • Lets see how?

  19. Outliers in Normal Distribution • Conventionally something is considered an outlier if it is at least three standard deviations away from the mean: • Lets assume we have a standard Normal Distribution: N(0,1) • We want P(X < -3) + P(X >3) • = normcdf(-3,0,1) + 1 – normcdf(3,0,1)=0.0027

  20. Outliers using Univariate Normal Distribution • Typically we are given data and we want to find outliers in the data –if any. • Here are the steps: • Make the assumption that the data come from a Normal distribution. • Estimate the parameters of the Normal distribution. • Find all data points which are more than three standard deviations away from the mean.

  21. Outliers in Multidimensional Data • Recall, in the Iris data, we have four attributes and one class label. • This is an example of multidimensional data set. • Look at the exponent of the Normal distribution. • This is the square of the distance from a point x to the mean μ in units of standard deviation σ

  22. Outliers in Multidimensional Data • In multidimensional data this can be generalized to: • This is called the Mahalanobis Distance (squared) • Σ is d x d matrix called the variance-covariance matrix

  23. Variance-Covariance Matrix If the Data set is an N x d matrix then

  24. In Matlab • Suppose we generate a random 100x5 data >> data = rand(100,5); • The covariance matrix is >>cv =cov(data) 0.0998 -0.0022 0.0006 -0.0080 -0.0025 -0.0022 0.0933 -0.0051 -0.0100 -0.0010 0.0006 -0.0051 0.0810 -0.0085 0.0083 -0.0080 -0.0100 -0.0085 0.0820 0.0071 -0.0025 -0.0010 0.0083 0.0071 0.0859

  25. Intuitive: Mahalanobis Distance

  26. Distribution of Mahalanobis Distance • It turns out that if an N x d data set A if from a multivariate Normal Distribution then the Mahalanobis distance follows a a Chi-Square distribution with d degrees of freedom.

  27. Chi-Square Distribution Curse of dimensionality

  28. Algorithm for Finding Outliers >>chi2inv(.975,d)

  29. Homework • Define first, second, third quantile in terms of cumulative distribution function? • Use that to understand the previous algorithm. • Start looking up Matlab help files in the Statistics toolbox. • Also, figure out what is the meaning of “estimating the parameter of a distribution from data”.

More Related