240 likes | 362 Views
3. Basics of R – cont. Summarising Statistical Data Graphical Displays 4. Basic distributions with R. CA200 (based on the book by Prof. Jane M. Horgan ). Discrete distributions: Binomial distribution. Def : Conditions: An experiment consists of n repeated trials
E N D
3. Basics of R – cont. Summarising Statistical DataGraphical Displays4. Basic distributions with R CA200 (based on the book by Prof. Jane M. Horgan)
Discrete distributions: Binomial distribution Def: Conditions: • An experiment consists of n repeated trials 2. Each trial has two possible outcomes a) a successwith probability p is constant from trial to trial b) a failurewith probability q = 1 − p 3. Repeated trials are independent X = number of successes in ntrials. X is a BINOMIAL RANDOM VARIABLE. The probability of getting exactly x successes in n trials is: In R use function: prefix “binom” with prefix “d” for pdf, “p” for cdf. Function are dbinom(x, n, p) pbinom(x, n, p)
Binomial distribution – cont. Example 1.Terminals on an on-line computer system are attached to a communication line to the central computer system. The probability that any terminal is ready to transmit is 0.95. p = 0.95 Let X = number of ready terminals Example 2.A fair coin is tossed 10 times. Success and failure are “heads” and “tails”, respectively, each with probability, 0.5. Lets denote: X = number of heads (successes)obtained Example 3.It is known that 20% of integrated circuit chips on a production line are defective. To maintain and monitor the quality of the chips, a sample of 20 chips is selected at regular intervals for inspection. Lets denote: X = number of defectives found in the sample p = 0.20 n = 20 P(X=0) = P(0 defective in a sample of size 20); P(X=1) = P(1 defectives in a sample of size 20) P(X=2) = P(2 defectives in a sample of size 20)
Binomial distribution – cont. Example 4. It is known that 1% of bits transmitted through a digital transmission are received in error. One hundred bits are transmitted each day. Lets denote: X = number of bits found in error each day p = 0.01 n = 100 P(X=0) = P(0 errors in the 100 transmission) P(X=1) = P(1 error in the 100 transmission) P(X=2) = P(2 errors in the 100 transmission)
Binomial distribution – cont. Calculating binomial pdfs in R: Use prefix “binom” with “d” for pdf. Function is dbinom with arguments n and p. Example 1: For all cumulative probabilities: x <- 0:5 dbinom(x, size = 5, prob = 0.95) [1] 0.0000003125 0.0000296875 0.0011281250 0.0214343750 0.2036265625 0.7737809375 The probability of getting exactly 3 ready terminals from five: dbinom(x = 3, size =5, prob = .95) [1] 0.02143438 Example 2: with round function for 4 decimals! x <- 0:10 round(dbinom(x, 10, 0.5), 4) [1] 0.0010 0.0098 0.0439 0.1172 0.2051 0.2461 0.2051 0.1172 0.0439 0.0098 0.0010 Example 3: dbinom(x, 20, 0.2) Example 4: dbinom(x, 100, 0.01) : Example4:
Binomial distribution – cont. Plotting binomial pdfs in R: par(mfrow = c(2, 2)) # multiframe x <- 0:5 #Example 1 plot(x, dbinom(x, size = 5, prob = 0.95), xlab = "Number of Ready Terminals", ylab = "P(X = x)", type = "h", main = "Ready Terminals, n = 5, p = 0.95") x <- 0:10 #Example 2 plot(x, dbinom(x, size = 10, prob = 0.5), xlab = "Number of Heads", ylab = "P(X = x)", type = "h", main = "Heads, n = 10, p = 0.5") x <- 0:20 #Example 3 plot(x, dbinom(x, size = 20, prob = 0.2), xlab = "Number of Defectives", ylab = "P(X = x)", type = "h", main = "Defectives, n = 20, p = 0.2") x <- 0:20 #Example 4. No need for full range here plot(x, dbinom(x, size = 100, prob = 0.01), xlab = "Number of Bits in Error", ylab = "P(X = x)", type = "h", main = "Bits in Error, n = 100, p = 0.01")
Binomial distribution – cont. Binomial pdfs
Binomial distribution – cont. Calculating binomial cdfs in R: Use prefix “binom” with “p” for cdf. Function is pbinom with arguments n and p. Example 1: n = 5, p = 0.95 P(X ≤ 3): pbinom(3, 5, 0.95) #[1] 0.0225925 P(X > 3): 1 - pbinom(3, 5, 0.95) #[1] 0. 9774075 Example 3: n = 20, p = 0.02 P(X ≤ 4): pbinom (4, size = 20, prob = 0.2) #[1] 0.6296483 P(X > 4) = 1 - P(X ≤ 4 ): 1- pbinom(4, size = 20, prob = 0.2) #[1] 0.3703517 For all cumulative probabilities x <- 0:20 prob <- pbinom(x, size = 20, prob= 0.2) round to 4 decimal places: round(prob, 4) # [1] 0.0115 0.0692 0.2061 0.4114 0.6296
Binomial distribution – cont. Plotting binomial cdfs in R: par(mfrow = c(2,2)) #multiframe x<-0:5 plot(x, pbinom(x, size = 5, prob = 0.95), xlab = "Number of Ready Terminals", ylab = "P(X <= x)", ylim = c(0, 1), type = "s", main = "n = 5, p = 0.95") x<-0:10 plot(x, pbinom(x, size = 10, prob = 0.5), xlab = "Number of Heads", ylab = "P(X< = x)", ylim = c(0, 1), type = "s", main= "n = 10, p = 0.5") x<-0:20 plot(x, pbinom(x, size = 20, prob = 0.2), xlab = "Number of Defectives", ylab = "P(X <= x)", ylim = c(0, 1), type = "s", main = "n = 20, p = 0.2") x<-0:100 plot(x, pbinom(x, size = 100, prob = 0.01), xlab = "Number of Error Bits", ylab = "P(X <= x)", ylim = c(0, 1), type = "s", main = "n = 100, p = 0.01")
Binomial distribution – cont. Binomial cdf
Normal distribution – continuous distribution • in practice, measures tend to cluster around a central values • the largest proportion of measurement occurring near central value and decreasing as the measurements extend at either side of this central value • bell-shaped curve symmetrical about the central value is called “normal distribution”
Normal distribution – cont. Example 5. Heights of females and males. We can accept that: • on average, males are taller than females but maybe spread around their respective averages could be the same • most of female and male heights cluster around the same average • the heights of females and males might be modelled by normal distribution: • bell-shaped curves • centredat differentpoints • variation at either side of central points is the same Height of females and males: X ~ N(1.65, 0.25) X ~ N(1.85, 0.25) Two normal distributions with different mean and the same spread
Normal distribution – cont. Example 6. Arises in product manufacturing. Components manufactured by two machines are set to a diameter 10 cm. One machine produces components with error 1cm: X ~ N(10, 1) The second machine has an error of 2cm: X ~ N(10, √2) • bell-shaped curves • centredat the same point, 10cm • variation at either side of central points is different → one high and narrow, the other low and wide Two normal distributions with the same mean and different standard deviations
Normal distribution – cont. The Probability Density Function for Normal distribution Def: The PDF of a normal (Gaussian) distribution, f(x), is a function defined as: It can be shown that E(X) = µ Var(X) = σ2 X is said to be normally distributed with a mean of μand a variance of σ2 i.e. X ~ N (μ, σ2) Increasing μ moves the distribution along the x-axis (example 1), and increasing σ2widens the distribution (example 2)
Normal distribution – cont. Generating Normal Curves To plot a normal curve use the function curve with dnorm as the argument: curve (dnorm(x, 2, 0.25), from = 1, to = 3) # plots a normal curve with a mean of 2 and a standard deviation of 0.25 in the range 1 to 3 curve (dnorm(x, 2, 0.25), from = 1, to = 3, labels = FALSE, tick = FALSE) # to suppress the labels and axis ticks
Normal distribution – cont. Generating Normal Curves To generate normal curves on same axis use (example 1, example 2) cbindwhich combines R objects with columns and matplot which plot columns as matrices x <- seq(0.5, 3, len = 101) #101 equidistant values between 0.5 and 3 y <- cbind(dnorm(x, 1.65, 0.25), dnorm(x, 1.85, 0.25)) #two normal curves matplot(x, y, type = "l", xlab = "height(metres)", ylab = "f(x)") #multiple plotting text(1.15, 0.5, "Females") text(2.35, 0.5, "Males")
Normal distribution – cont. The Cumulative Distribution Function Def: The CDF of normal distribution in R is obtained with pnormfunction: P(X ≤ x) = pnorm(x, μ,σ) Example 7.From previous records it is known that examination results are normally distributed with mean μ = 45 and the standard deviation σ= 4: X ~ N(45, 16) What percentage of students obtain a mark • less than 45; • larger than 45; • larger than 50; • between 40 and 50; • greater than 37?
Conditional Probabilities in Normal distribution Example 8.An analogue signal received at a detector (measured in microvolts), is normally distributed with a mean of 100 and a variance of 256. What is the probability that the signal will be less than 120 microvolts, given that it is larger than 110 microvolts? Solution 8. Recall Conditional probability of B given A: For P(110 < X < 120) pnorm(120, 100, 16) - pnorm(110, 100, 16) [1] 0.1603358 P(X > 110) 1 - pnorm(110, 100, 16) [1] 0.2659855 X ~ N(100, 16)
Example 9.From previous records scores on an aptitude are normally distributed with mean μ = 500 and standard deviation σ= 100. What is the probability that an individual’s score exceeds 650, given that it exceeds 500? Solution 9. For P(X > 650) 1 - pnorm(650, 500, 100) [1] 0.0668072 P(X > 110) 1 - pnorm(500, 500, 100) [1] 0.5 X ~ N(500, 100)
Example 10. Summary for basic statistics with R: • (Based on the Excel file “CA200_Questionnaire_Summary”which has been previously circulated) • Read the survey answers into a suitable data frame in R. [It was necessary to save the file, from within Excel, in comma separated (csv) format. But this has already been done for you that is the circulated file is of .csv type] • To read the file in R use “read.csv”. For example, if the file is on the C drive then • results<-read.csv(“C:/CA200_Questionnaire_Summary.csv”, header=T) • this will read the data from the file into the data frame called “results”. • As a check that the data have been read correctly, one can see that the command: • results$Q1 • gives results from column Q1 only: • [1] 1 1 1 1 1 2 1 1 2 1 1 2 1 2 2 1 1 1 1 1 1 1 2 2 1 2 2 NA 1 2 1 NA 12 2 1 1 2 2 1 1 1 1 1 1 2 • Use R to find out how many males and females there are (check the questionnaire for the questions!) For number of females and males use summary function. This will give you all important information: • - Hint: First convert categorical variable Q1 into a factor using: • results$Q1<-factor(results$Q1) • Then use summary(results) to find out how many females/males are out there • Use R to and to separate data for females and for males into two different data sets: • Use subset function: • m_results<-subset(results, results$Q1==1) • To check if this is properly entered use function to print results in the work space: • m_results#This will filter only data for males • repeat same for separating results for females by using: • f_results<-subset(results, results$Q1==2) #This will filter only data for females
Example 10. Summary - cont. Identify any other categorical variables (questions) and similarly specify these as factors. Then, repeat the summary(results) command. Calculate mean and standard deviation of the salary for summer job for females Q13b: > f_results$Q13b [1] 200 150 NA 240 NA 150 NA 300 200 200 300 NA NANA 330 > summary(f_results$Q13b) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 150 200 200 230 300 330 6 > mean(f_results$Q13b) #don’t forget to remove all NAs [1] NA > mean(f_results$Q13b, na.rm=T) [1] 230 > sd(f_results$Q13b, na.rm=T) [1] 66.52067 For question e) use R to calculate skewness and kurtosis: Note: you will have to install additional package which contains these functions. The name of a package is “e1071”. For installation use: > install.packages(“e1071”) > library("e1071") Loading required package: class Warning message: package 'e1071' was built under R version 2.13.2 > skewness(f_results$Q13b, na.rm=T) [1] 0.2196901 #what can you conclude from this number? > kurtosis(f_results$Q13b, na.rm=T) [1] -1.68135 #what can you conclude from this number?
Example 10. Summary - cont. Note: Categorical data are generally non-numerical data which are placed into exclusive categories and then counted rather than measured. In fact, there are two types of categorical data, namely Nominal: Simple names or labels which cannot be ordered (e.g. car registration number or a PIN number) Ordinal: Can be placed in a meaningful order (though without any measurements). For example, “small”, “medium” or “large” vests. A factor is a vector object used to specify a discrete classification (grouping) of the components. of other vectors of the same length. Plotthe data from e) in a histogram and compare to Normal curve by e.g. superposition: > m <-mean(f_results$Q13b, na.rm=T) > s <-sd(f_results$Q13b, na.rm=T) > mn <-min(f_results$Q13b, na.rm=T) > mx <-max(f_results$Q13b, na.rm=T) >c<-curve(dnorm(x, m, s), from=mn, to=mx) >hist(f_results$Q13b, freq=FALSE, xlab=“put adequate name”, ylab=“density”, main=“put adequate name”) > lines(c)