290 likes | 396 Views
Random Variables and Distributions. COMP5318 Knowledge Discovery and Data Mining. Examples. Examples. We have heard of statements like “ Height is Normally Distributed ”. Standard deviation. mean. Why distributions are important.
E N D
Random Variables and Distributions COMP5318 Knowledge Discovery and Data Mining
Examples • We have heard of statements like “Height is Normally Distributed” Standard deviation mean
Why distributions are important • Distribution capture the essence of data associated with a particular variable(s) (e.g., height). • If we know height is Normally distributed then a small random sample is enough to provide a very good idea about the general population. • Can answer questions like: what is the probability of finding a 2 meter tall Australian? • Need to understand the concept of random variable.
Random Variable • Let S be the sample space. • A random variable X is a function X: SReal Suppose we toss a coin twice. Let X be the random variable number of heads
Random Variable(Number of Heads in two coin tosses) We also associate a probability with X attaining that value.
Random Variables follow a Distribution • The height of Australian soldiers is a random variable which follows a Normal distribution with mean 180 cm and standard deviation 15 cm. • The frequency of words in a text is a random variable which follows a Zipf distribution. • The speed of a hurricane is a random variable which follows a Cauchy distribution. • The number of car accidents in a fixed time duration is a random variable which follows a Poisson distribution. • The number of heads in a sequence of coin tosses is a random variable which follows a Binomial distribution. • The number of web hits in a given time period is a r.v. which follows a Pareto distribution. • Many times we don’t know what named distribution a r.v. follows or whether it follows any named distribution at all!
Distribution Definitions • Discrete Probability Distribution • Continuous Probability Distribution • Cumulative Distribution Function
Discrete Distribution • A r.v. X is discrete if it takes countably many values {x1,x2,….} • The probability function or probability mass function for X is given by • fX(x)= P(X=x) • From previous example
Continuous Distributions • A r.v. X is continuous if there exists a function fX such that
Example: Continuous Distribution • Suppose X has the pdf • This is the Uniform (0,1) distribution
Binomial Distribution • A coin flips Heads with probability p. Flip it n times and let X be the number of Heads. Assume flips are independent. • Let f(x) =P(X=x), then
Binomial Example • Let p =0.5; n = 5 then • In Matlab >>binopdf(4,5,0.5)
Normal Distribution • X has a Normal (Gaussian) distribution with parameters μ and σ if • X is standard Normal if μ =0 and σ =1. It is denoted as Z. • If X ~ N(μ, σ2) then
Normal Example • The number of spam emails received by a email server in a day follows a Normal Distribution N(1000,500). What is the probability of receiving 2000 spam emails in a day? • Let X be the number of spam emails received in a day. We want P(X = 2000)? • The answer is P(X=2000) = 0; • It is more meaningful to ask P(X >= 2000);
Normal Example • This is • In Matlab: >> 1 –normcdf(2000,1000,500) • The answer is 1 – 0.9772 = 0.0228 or 2.28% • This type of analysis is so common that there is a special name for it: cumulative distribution function F.
Outliers • In data mining we are often interested in outliers • especially in high dimensional data which we cannot easily visualize • A knowledge of distributions can be very useful in this context. • Lets see how?
Outliers in Normal Distribution • Conventionally something is considered an outlier if it is at least three standard deviations away from the mean: • Lets assume we have a standard Normal Distribution: N(0,1) • We want P(X < -3) + P(X >3) • = normcdf(-3,0,1) + 1 – normcdf(3,0,1)=0.0027
Outliers using Univariate Normal Distribution • Typically we are given data and we want to find outliers in the data –if any. • Here are the steps: • Make the assumption that the data come from a Normal distribution. • Estimate the parameters of the Normal distribution. • Find all data points which are more than three standard deviations away from the mean.
Outliers in Multidimensional Data • Recall, in the Iris data, we have four attributes and one class label. • This is an example of multidimensional data set. • Look at the exponent of the Normal distribution. • This is the square of the distance from a point x to the mean μ in units of standard deviation σ
Outliers in Multidimensional Data • In multidimensional data this can be generalized to: • This is called the Mahalanobis Distance (squared) • Σ is d x d matrix called the variance-covariance matrix
Variance-Covariance Matrix If the Data set is an N x d matrix then
In Matlab • Suppose we generate a random 100x5 data >> data = rand(100,5); • The covariance matrix is >>cv =cov(data) 0.0998 -0.0022 0.0006 -0.0080 -0.0025 -0.0022 0.0933 -0.0051 -0.0100 -0.0010 0.0006 -0.0051 0.0810 -0.0085 0.0083 -0.0080 -0.0100 -0.0085 0.0820 0.0071 -0.0025 -0.0010 0.0083 0.0071 0.0859
Distribution of Mahalanobis Distance • It turns out that if an N x d data set A if from a multivariate Normal Distribution then the Mahalanobis distance follows a a Chi-Square distribution with d degrees of freedom.
Chi-Square Distribution Curse of dimensionality
Algorithm for Finding Outliers >>chi2inv(.975,d)
Homework • Define first, second, third quantile in terms of cumulative distribution function? • Use that to understand the previous algorithm. • Start looking up Matlab help files in the Statistics toolbox. • Also, figure out what is the meaning of “estimating the parameter of a distribution from data”.