Random Variables and Distributions

Random Variables and Distributions COMP5318 Knowledge Discovery and Data Mining

Examples

Examples • We have heard of statements like “Height is Normally Distributed” Standard deviation mean

Why distributions are important • Distribution capture the essence of data associated with a particular variable(s) (e.g., height). • If we know height is Normally distributed then a small random sample is enough to provide a very good idea about the general population. • Can answer questions like: what is the probability of finding a 2 meter tall Australian? • Need to understand the concept of random variable.

Random Variable • Let S be the sample space. • A random variable X is a function X: SReal Suppose we toss a coin twice. Let X be the random variable number of heads

Random Variable(Number of Heads in two coin tosses) We also associate a probability with X attaining that value.

Random Variable(Number of Heads in two coin tosses)

Random Variables follow a Distribution • The height of Australian soldiers is a random variable which follows a Normal distribution with mean 180 cm and standard deviation 15 cm. • The frequency of words in a text is a random variable which follows a Zipf distribution. • The speed of a hurricane is a random variable which follows a Cauchy distribution. • The number of car accidents in a fixed time duration is a random variable which follows a Poisson distribution. • The number of heads in a sequence of coin tosses is a random variable which follows a Binomial distribution. • The number of web hits in a given time period is a r.v. which follows a Pareto distribution. • Many times we don’t know what named distribution a r.v. follows or whether it follows any named distribution at all!

Distribution Definitions • Discrete Probability Distribution • Continuous Probability Distribution • Cumulative Distribution Function

Discrete Distribution • A r.v. X is discrete if it takes countably many values {x1,x2,….} • The probability function or probability mass function for X is given by • fX(x)= P(X=x) • From previous example

Continuous Distributions • A r.v. X is continuous if there exists a function fX such that

Example: Continuous Distribution • Suppose X has the pdf • This is the Uniform (0,1) distribution

Binomial Distribution • A coin flips Heads with probability p. Flip it n times and let X be the number of Heads. Assume flips are independent. • Let f(x) =P(X=x), then

Binomial Example • Let p =0.5; n = 5 then • In Matlab >>binopdf(4,5,0.5)

Normal Distribution • X has a Normal (Gaussian) distribution with parameters μ and σ if • X is standard Normal if μ =0 and σ =1. It is denoted as Z. • If X ~ N(μ, σ2) then

Normal Example • The number of spam emails received by a email server in a day follows a Normal Distribution N(1000,500). What is the probability of receiving 2000 spam emails in a day? • Let X be the number of spam emails received in a day. We want P(X = 2000)? • The answer is P(X=2000) = 0; • It is more meaningful to ask P(X >= 2000);

Normal Example • This is • In Matlab: >> 1 –normcdf(2000,1000,500) • The answer is 1 – 0.9772 = 0.0228 or 2.28% • This type of analysis is so common that there is a special name for it: cumulative distribution function F.

Outliers • In data mining we are often interested in outliers • especially in high dimensional data which we cannot easily visualize • A knowledge of distributions can be very useful in this context. • Lets see how?

Outliers in Normal Distribution • Conventionally something is considered an outlier if it is at least three standard deviations away from the mean: • Lets assume we have a standard Normal Distribution: N(0,1) • We want P(X < -3) + P(X >3) • = normcdf(-3,0,1) + 1 – normcdf(3,0,1)=0.0027

Outliers using Univariate Normal Distribution • Typically we are given data and we want to find outliers in the data –if any. • Here are the steps: • Make the assumption that the data come from a Normal distribution. • Estimate the parameters of the Normal distribution. • Find all data points which are more than three standard deviations away from the mean.

Outliers in Multidimensional Data • Recall, in the Iris data, we have four attributes and one class label. • This is an example of multidimensional data set. • Look at the exponent of the Normal distribution. • This is the square of the distance from a point x to the mean μ in units of standard deviation σ

Outliers in Multidimensional Data • In multidimensional data this can be generalized to: • This is called the Mahalanobis Distance (squared) • Σ is d x d matrix called the variance-covariance matrix

Variance-Covariance Matrix If the Data set is an N x d matrix then

In Matlab • Suppose we generate a random 100x5 data >> data = rand(100,5); • The covariance matrix is >>cv =cov(data) 0.0998 -0.0022 0.0006 -0.0080 -0.0025 -0.0022 0.0933 -0.0051 -0.0100 -0.0010 0.0006 -0.0051 0.0810 -0.0085 0.0083 -0.0080 -0.0100 -0.0085 0.0820 0.0071 -0.0025 -0.0010 0.0083 0.0071 0.0859

Intuitive: Mahalanobis Distance

Distribution of Mahalanobis Distance • It turns out that if an N x d data set A if from a multivariate Normal Distribution then the Mahalanobis distance follows a a Chi-Square distribution with d degrees of freedom.

Chi-Square Distribution Curse of dimensionality

Algorithm for Finding Outliers >>chi2inv(.975,d)

Homework • Define first, second, third quantile in terms of cumulative distribution function? • Use that to understand the previous algorithm. • Start looking up Matlab help files in the Statistics toolbox. • Also, figure out what is the meaning of “estimating the parameter of a distribution from data”.

Random Variables and Distributions

Random Variables and Distributions

Presentation Transcript

Discrete Random Variables and Probability Distributions

Random Variables and Discrete probability Distributions

Random Variables and Discrete probability Distributions

Random Variables and Probability Distributions

Random Variables and Probability Distributions

Random Variables and Probability Distributions

Discrete Random Variables and Probability Distributions

Random variables, distributions and limit theorems

Random Variables and Probability Distributions

Probability Theory Random Variables and Distributions

Discrete Random Variables and Probability Distributions

STATISTICS Random Variables and Probability Distributions

Random Variables and Probability Distributions

Random Variables and Probability Distributions

2. Discrete Random Variables and Distributions

Random Variables and Probability Distributions

Random Variables and Distributions

Continuous Random Variables and Probability Distributions

Random Variables and Probability Distributions

Random Variables and Probability Distributions