330 likes | 576 Views
Outline. Random samplingProbability calculations and combinatoricsDiscrete distributionsContinuous distributionsThe Central limit theoremExploratory data analysis. Random sampling The sample function. Simulating a random experiment by sample function.Choose 5 numbers at random from the set
E N D
1. Probability and distributions
2. Outline Random sampling
Probability calculations and combinatorics
Discrete distributions
Continuous distributions
The Central limit theorem
Exploratory data analysis
3. Random sampling – The sample function Simulating a random experiment by sample function.
Choose 5 numbers at random from the set {1,2,…, 40}
> sample(1:40,5)
[1] 7 24 35 5 1
Note: the default behaviour of sample is sampling without replacement;
use the syntax replace = TRUE if you want sample with replacement.
Simulating 10 coin tosses
> sample(c("H","T"),10,replace=T)
[1] "T" "T" "T" "T" "H" "H" "H" "T" "H" "H"
Simulating with specific different probabilities
> sample(c("H","T"),10,replace=T,prob = c(0.9,0.1))
[1] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H"
4. Random sampling – The sample function # Roll a die
> sample(1:6,10,replace=TRUE)
[1] 5 1 5 3 3 4 5 4 2 1 # no sixes!
## pick 6 of 54 (a lottery)
> sample(1:54,6) # no replacement
[1] 6 39 23 35 25 26
## pick a card. (Fancy! Uses paste, rep)
> cards = paste(rep(c("A",2:10,"J","Q","K"),4),c("H","D","S","C"))
> sample(cards,5) # a pair of jacks, no replacement
[1] "J D" "5 C" "A S" "2 D" "J H"
## roll 2 die. Even fancier
> dice = as.vector(outer(1:6,1:6,paste))
> sample(dice,5,replace=TRUE) # replace when rolling dice
[1] "1 1" "4 1" "6 3" "4 4" "2 6"
5. Probability calculations and combinatorics Factorials and permutation: prod(x), factorial(x)
Combinations: choose(n,x)
Example: a printed circuit board has eight different locations in which a component can be loaded. If four different components are to be placed on the board, how many different designs are possible?
> factorial(8)/factorial(8-4)
[1] 1680
If ?ve identical components are to be placed on the board, how many
different designs are possible?
> choose(8,5)
[1] 56
6. Discrete distributions For a discrete random variable X with possible values x1, x2, …, xn, a probability mass function is a function such that
The cumulative distribution function of a discrete random variable X, denoted as F(x), is
For a discrete random variable X, F(x) satifies the following properties
0 ? F(x) ? 1
If x < y, then F(x) ? F(y)
7. Discrete distributions Bernoulli
Binomial
Poisson
Geometric
Hypergeometric
Multinomial
Negative binomial
8. Discrete distributions – R functions R has a wide range of built-in probability functions, for each of which four functions are available:
Probability density function which has a d prefix.
Cumulative probability function: p preifx.
Quantiles of the distribution: q prefix.
Random numbers generated from the distribution: r prefix.
Each letter (d, p, q or r) can be prefixed to the R functions.
9. Discrete distributions – summary R functions
10. Discrete distributions – Binomial distribution Let X has a binomial distribution with n = 10 and p = 0.4, Caculate:
P(X = 5)
P(X ? 3)
Find x that P(X ? x) = 0.975
> dbinom(5,10,0.4)
[1] 0.2006581
> 1-pbinom(2,10,0.4)
[1] 0.8327102
> sum(dbinom(3:10,10,0.4))
[1] 0.8327102
> qbinom(0.975,10,0.4)
[1] 7
11. Discrete distributions – Binomial distribution > rbinom(25,10,0.4)
[1] 7 3 4 5 4 4 6 5 5 2 4 5 4 5 3 4 3 2 6 1 6 2 5 1 5
> barplot(dbinom(0:10,10,0.4))
> plot(0:10,dbinom(0:10,10,0.4),type='h')
> points(0:10,dbinom(0:10,10,0.4))
12. Discrete distributions – Poisson distribution Simulating 600 values from a Poisson distribution with mean of 0.90
> count <- rpois(600,0.9)
> table(count)
count
0 1 2 3 4 5
250 204 101 36 5 4
> hist(count,breaks=-0.5:6,main="")
13. Continuous distributions For a continuous random variable X, a probability denstiy function is a function such that
The cumulative distribution function of a discrete random variable X, denoted as F(x), is
for .
14. Continuous distributions Normal
Exponential
Uniform
Gamma
Student’s t
Fisher’s F
Chi-squared
15. Continuous distributions – R functions
16. Continuous distributions – summary R functions
17. Continuous distributions – Normal distribution Suppose the heights of people in a city have a normal distribution with the mean 170 cm and the standard deviation 8 cm. What is the probability that a randomly selected individual will be:
shorter than 165 cm
taller than 185 cm
between 160 and 170 cm
Generating 100 values of height that have above distribution.
18. Continuous distributions – Normal distribution > mu = 170; sd = 8
> pnorm(165,mu,sd)
[1] 0.2659855
> 1 - pnorm(185,mu,sd)
[1] 0.03039636
> pnorm(170,mu,sd) - pnorm(160,mu,sd)
[1] 0.3943502
> height <- rnorm(100,mu,sd)
> hist(height)
> curve(dnorm(x,mu,sd),140,200)
19. Continuous distributions – Standard normal distribution From a normal distribution X ~ N( ?, ?2) , we can transform X to a standard normal distribution Z by
Z has mean 0 and variance 1.
So, we have
> pnorm(-0.625)
[1] 0.2659855
20. Continuous distributions – Standard normal distribution > rnorm(1,100,16) #an IQ score
[1] 115.2561
> x <- rnorm(100)
> hist(x,prob=T,col='gray',main="normal mu=0,sigma=1")
> curve(dnorm,add=T,col='red')
21. Continuous distributions – Exponential distribution > x=rexp(100,1/2500)
> hist(x,probability=TRUE,col=gray(.9),main="exponential mean=2500")
> curve(dexp(x,1/2500),add=T)
22. The central limit theorem If Xi is drawn independently from a population where ? and ? are known, the the standardized average
is asymptotically normal with mean 0 and variance 1. That is, if n is large enough the average is approximately normal with mean ? and standard deviation .
How can we check this? Simulation is an excellent way.
23. The central limit theorem – Binomial dist. If Sn has a binomail distribution with parameters n and p then
is approximately normal (0,1).
> n=10;p=.25;S= rbinom(1,n,p)
> (S - n*p)/sqrt(n*p*(1-p))
[1] 0.3651484
> n = 10;p = .25;S = rbinom(100,n,p)
> X = (S - n*p)/sqrt(n*p*(1-p)) # has 100 random numbers
> hist(X,prob=T)
> curve(dnorm,add=T)
24. The central limit theorem – Binomial dist. > hist(rbinom(10000,20,0.5),xlim=c(0,20), probability=T,breaks=seq(0.5,20.5,1))
> lines(seq(0,20,0.1),dnorm(seq(0,20,0.1), 10,sqrt(5)))
#Non symmetric binomial
> hist(rbinom(10000,20,0.3),xlim=c(0,20), probability=T, breaks=seq(- 0.5,15.5,1))
> lines(seq(0,20,0.1),dnorm(seq(0,20,0.1), 6,sqrt(4.2)))
25. Normal plots A better plot than the histogram for deciding if random data is approximately normal is the so called "normal probability" plot.
The basic idea is to graph the quantiles of your data against the corresponding quantiles of the normal distribution.
Essentially, if the graph looks like a straight line then the data is approximately normal. Any curve can tell you that the distribution has short or long tails. The line is drawn through points formed by the first and third quantiles.
Using qqnorm, qqplot and qqline.
26. Normal plots > x <- rnorm(100)
> y <- rnorm(100,10,15)
> z <- rexp(100,1/10)
> t <- runif(100,0,1)
> par(mfrow=c(2,2))
> qqnorm(x,main='normal(0,1)');qqline(x)
> qqnorm(y,main='normal(10,15)');qqline(y)
> qqnorm(z,main='exponential mu=10');qqline(z)
> qqnorm(t,main='unif(0,1)');qqline(t)
27. Exploratory Data Analysis
28. Example – Home data The dataset homedata contains assessed values for Maplewood, NJ
for the year 1970 and the year 2000. What is the shape of the
distribution?
data(homedata) # from simple package
attach(homedata)
hist(y1970);hist(y2000) # make two histograms
detach(homedata) # clean up
On first appearances : the 1970 data looks more normal, the year 2000 data has a heavier tail.
Neither looks particularly normal - both are heavy tailed and skewed. Any analysis will want to consider the medians or a transformation.
29. Example – CEO salaries The data set exec.pay gives the total direct compensation for CEO's at 200 large publicly traded companies in the U.S for the year 2000 (in units of $100,000). What can we say about this distribution besides it looks like good work if you can get it? Using simple.eda.
data(exec.pay) # or read in from file
simple.eda(exec.pay)
From figures, we see a heavily skewed distribution as we might expect. A transformation is called for, let's try the logarithmic transformation (base 10). Since some values are 0 (these CEO's are directly compensated less than $100,000 or perhaps were forced to return all profits in a plea arrangement to stay out of jail), we ask not to include these.
30. Example – CEO salaries log.exec.pay = log(exec.pay[exec.pay >0])/log(10) # 0 is a problem
simple.eda(log.exec.pay)
After transformation, we see that the data is now very symmetric and gives good insight into the actual distribution. (Almost log normal, which says that after taking a logarithm, it looks like a normal.) Any analysis will want to use resistant measures such as the median or a transform prior to analysis.
31. Example – Taxi time at EWR The dataset ewr contains taxi in and taxi out times at Newark airport (EWR). Let's see what the trends are.
#Taxi time at EWR
data(ewr)
names(ewr) # only 3-10 are raw data
airnames = names(ewr) # store them for later
ewr.actual = ewr[,3:10] # get the important columns
boxplot(ewr.actual)
32. Example – Taxi time at EWR From figures, we see that all of them look skewed. Let's see if there is a difference between taxi in and out times.
par(mfrow=c(2,4)) # 2 rows 4 columns
attach(ewr)
for(i in 3:10) boxplot(ewr[,i] ~ as.factor(inorout),main=airnames[i])
detach(ewr)
par(mfrow=c(1,1)) # return graphics as is (or close window)
Notice the taxi in times are more or less symmetric with little variation (except for HP-America West –with a 10 minute plus average). The taxi out times have a heavy tail. At EWR, when the airport is busy, the planes can really backup and the 30 minute wait is not unusual. The data for Northwest (NW) seems to be less. We can compare this using statistical tests. Since the distributions are skewed, we may wish to compare the medians. (In general, be careful when applying statistical tests to summarized data.)
33. Example – Symmetric or skewed, Long or short? For unimodal data, there are 6 basic possibilities as it is symmetric or skewed, and the tails are short, regular or long. Here are some examples with random data from known distributions.
## symmetric: short, regular then long
X=runif(100);boxplot(X,horizontal=T,bty=n)
X=rnorm(100);boxplot(X,horizontal=T,bty=n)
X=rt(100,2);boxplot(X,horizontal=T,bty=n)
## skewed: short, regular then long
# triangle distribution
X=sample(1:6,100,p=7-(1:6),replace=T);boxplot(X,horizontal=T,bty=n)
X=abs(rnorm(200));boxplot(X,horizontal=T,bty=n)
X=rexp(200);boxplot(X,horizontal=T,bty=n)