310 likes | 322 Views
This lecture covers population structure analysis in statistical genomics, including methods like GWAS with GLM, inflation of P values, and principal component analysis. Learn how to build an R package for GWAS and interpret results.
E N D
Statistical Genomics Lecture 12: Population structure Zhiwu Zhang Washington State University
Administration Homework 3 due Mar 2, Wednesday, 3:10PM Midterm exam: February 26, Friday, 50 minutes (3:35-4:25PM), 25 questions. Homework4: Team with maximum of three peoples to build a R package of GWAS with GLM method.
Outline Inflation of P value population subdivision Population structure Principal component
QTNs 0n CHR 1-5, leave 6-10 empty myGD=read.table(file="http://zzlab.net/GAPIT/data/mdp_numeric.txt",head=T) myGM=read.table(file="http://zzlab.net/GAPIT/data/mdp_SNP_information.txt",head=T) setwd("~/Dropbox/Current/ZZLab/WSUCourse/CROPS545/Demo") source("G2P.R") source("GWASbyCor.R") X=myGD[,-1] index1to5=myGM[,2]<6 X1to5 = X[,index1to5] set.seed(99164) mySim=G2P(X= X1to5,h2=.75,alpha=1,NQTN=10,distribution="norm") p= GWASbyCor(X=X,y=mySim$y)
The top five associations color.vector <- rep(c("deepskyblue","orange","forestgreen","indianred3"),10) m=nrow(myGM) plot(t(-log10(p))~seq(1:m),col=color.vector[myGM[,2]]) abline(v=mySim$QTN.position, lty = 2, lwd=2, col = "black")
Cutoff from null distribution of P values: CHR 6-10 index.null=!index1to5 & !is.na(p) p.null=p[index.null] m.null=length(p.null) index.sort=order(p.null) p.null.sort=p.null[index.sort] head(p.null.sort) tail(p.null.sort) seq=seq(1:m.null) table=cbind(seq, p.null.sort, seq/m.null) head(table,15) tail(table) 1% of observed p values are below 0.0000328 P value of 3.28E-5 is equivalent to 1% type 1 error
Null distribution of P values hist(p[!index1to5])
QQ plot p.obs=p[!index1to5] m2=length(p.obs) p.uni=runif(m2,0,1) order.obs=order(p.obs) order.uni=order(p.uni) plot( (p.uni[order.uni]), (p.obs[order.obs])) abline(a = 0, b = 1, col = "red")
QQ plot in Log scale plot(-log10(p.uni[order.uni]),-log10(p.obs[order.obs])) abline(a = 0, b = 1, col = "red")
Cutoff (Graph approach) plot(ecdf(-log10(p.obs))) 5% 10E-3
Cutoff (Exact)) type1=c(0.01, 0.05, 0.1, 0.2) cutoff=quantile(p.obs,type1,na.rm=T) cutoff plot(type1,cutoff,type="b") P value of 0.002 is equivalent to type 1 error of 5%
Linkage equilibrium A G Disease T G T C Random mating AG TG AC TC Control Case
Association study X2=4(2*2/4)=4, df=1, P=4.5%
Linkage disequilibrium (LD) A G Disease T G T C Random mating Geography Breeding and family All A as control and half T as case AG TG TC Control Case
CM37 R4 K148 Mo46 Ky228 Hi27 OH7B Mo47 NC344 K4 DE-3 Yu796-NS NC360 Mo45 MO17 B97 CMV3 CO106 A682 Mt42 W401 CI91B NC362 NC262 NC222 A556 DE811 NC258 B103 Tzi25 B105 NC342 NC364 CI187-2 CI3A B77 W117HT Tzi16 MS153 DE1 SD40 A641 NC290A A214N NC250 STIFF STALK B164 NC236 CM7 N7A N28HT H100 DE-2 B57 H84 I205 B64 C123 H105W A635 CO109 ND246 A632 NON STIFF STALK C103 B68 CO125 B79 H91 A634 B84 B14A Hy B76 Ky21 CM174 B104 A661 WD CM105 A554 B75 CI21E 38-11 B37 MS71 Os420 NC260 NC328 R229 Mo44 A679 Mo1W A680 R168 B73 B73Htrhm NC294 NC326 B109 NC368 N192 NC324 NC292 NC314 NC322 NC330 W64A Pa875 NC308 NC372 NC306 NC312 CH9 H49 NC268 NC310 A619 B10 WF9 B46 SD44 A239 Pa880 T8 A188 OH43 Pa762 C49A C49 VA26 Va102 Ky226 Oh43E A654 W153R Va35 Va14 Va59 A659 CI-7 Oh40B Va17 R177 Va22 W22 H95 W182B Va99 PA91 H99 M14 CI90C 33-16 Va85 CH701-30 NC33 VaW6 4226 NC232 L317 B115 R109B MoG I137TN K55 CI66 CI44 NC230 81-1 CI31A MEF 156-55-2 M162W CI64 IL677A K64 Ia5125 E2558W N6 SWEET IA2132 P39 IL14H CML52 T234 SC357 L578 IL101 CML69 CML14 CML38 B52 CML287 EP1 Tzi11 F2 NC366 CML103 CML108 F7 SC213R CO255 CML9 GT112 CML61 NC238 CML254 CML5 T232 GA209 CML314 CML264 Mp339 CI28A CML258 Q6199 CML10 B2 U267Y CML341 CML332 CML11 CML45 MS1334 CML261 CML331 Mo24W D940Y Sg1533 SG18 IDS28 F2834T M37W HP301 IDS69 SA24 IDS91 CML277 CML238 CML322 CML321 A6 F44 Ki14 CML247 Ki11 4722 Ki2021 F6 I-29 TROPICAL-SUBTROPICAL POPCORN CML157Q Ki44 Oh603 Ki43 CML328 NC340 Ki21 CML323 Ki2007 CML228 NC300 CML92 Tx303 A272 CML218 NC320 NC356 NC302 CML77 NC318 NC332 SC55 A441-5 Ki3 NC338 NC358 NC334 CML154Q NC354 TZI18 NC370 TZI10 NC264 Ab28A CML220 Mo18W Tzi9 TX601 CML349 NC350 CML333 CML158Q NC304 MIXED CML91 TZI8 CML311 Based on 89 SSR loci 0.1 CML281 NC346 NC296A parvi-03 NC336 NC352 NC296 ssp. parviglumis Flint-Garcia et al. (2005) Plant J. 44: 1054 NC348 NC298 parvi-14 parvi-30 parvi-49 parvi-36
Jonathan K. Pritchard, Matthew Stephens and Peter Donnelly. Inference of Population Structure Using Multilocus Genotype Data. Genetics, 2000. Population structure
Principal Component Analysis (PCA) Y PC2 PC1 X • X-Y: Correlated • PCs • Uncorrelated • Var(PC1)>Var(PC2)
Eigen value and eigen vector AV=λV Covariance matrix (symmetric ) eigen vector eigen value)
Eigen value and eigen vector Covariance or correlation p by p data n individual by p features X→A → λV Y=XV Principal Component
PCA in R pca=prcomp(X[,1:10]) str(pca) List of 5 $ sdev : num [1:10] 1.394 1.114 0.922 0.828 0.793 ... $ rotation: num [1:10, 1:10] -0.0574 0.058 -0.5834 0.1601 0.5196 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:10] "PZB00859.1" "PZA01271.1" "PZA03613.2" "PZA03613.1" ... .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ... $ center : Named num [1:10] 1.52 1.02 1.42 1.5 1.06 ... ..- attr(*, "names")= chr [1:10] "PZB00859.1" "PZA01271.1" "PZA03613.2" "PZA03613.1" ... $ scale : logi FALSE $ x : num [1:281, 1:10] 1.45 2.05 1.98 1.78 2.05 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ... - attr(*, "class")= chr "prcomp"
PCA in R PCA=prcomp(X) str(PCA) Listof5 $ sdev : num [1:281] 10.287.766.275.735.65... $ rotation: num [1:3093, 1:281] 0.001030.029620.002960.03495-0.01134... ..- attr(*, "dimnames")=Listof2 ....$ : chr [1:3093] "PZB00859.1""PZA01271.1""PZA03613.2""PZA03613.1"... ....$ : chr [1:281] "PC1""PC2""PC3""PC4"... $ center : Namednum [1:3093] 1.521.021.421.51.06... ..- attr(*, "names")= chr [1:3093] "PZB00859.1""PZA01271.1""PZA03613.2""PZA03613.1"... $ scale : logiFALSE $ x : num [1:281, 1:281] 1.68-1.6-0.92.130.63... ..- attr(*, "dimnames")=Listof2 ....$ : NULL ....$ : chr [1:281] "PC1""PC2""PC3""PC4"... - attr(*, "class")= chr"prcomp"
Extraction PCA$x[1:10,1:5] Eigen value: $sdevsquaed Eigen vector: $rotation Principal component: $x
Contribution pcavar=PCA$sdev^2 proportion=pcavar/sum(pcavar) par(mfrow=c(1,3),mar = c(3,4,1,1)) barplot(PCA$sdev[1:10]) barplot(pcavar[1:10]) plot(proportion[1:10],type="b")
Visualization plot(PCA$x[,1],PCA$x[,2],col="red")
Association with phenotypes PC1 r=0.2 par(mfrow=c(2,1),mar = c(3,4,1,1)) plot(mySim$y,PCA$x[,1]) cor(mySim$y,PCA$x[,1]) plot(mySim$y,PCA$x[,2]) cor(mySim$y,PCA$x[,2]) r=-0.32 PC2
With QTNs Without QTNs Association r=-0.16 r=-0.21 PC1 pca1to5=prcomp(X[,index1to5]) pca6to10=prcomp(X[,!index1to5]) par(mfrow=c(2,2), mar = c(3,4,1,1)) plot(mySim$y, pca1to5$x[,1]) cor(mySim$y, pca1to5$x[,1]) plot(mySim$y, pca6to10$x[,1]) cor(mySim$y, pca6to10$x[,1]) plot(mySim$y, pca1to5$x[,2]) cor(mySim$y, pca1to5$x[,2]) plot(mySim$y, pca6to10$x[,2]) cor(mySim$y, pca6to10$x[,2]) PC2 r=-0.33 r=0.31
This partially explains the inflation plot(-log10(p.uni[order.uni]),-log10(p.obs[order.obs])) abline(a = 0, b = 1, col = "red")
Highlight Inflation of P value population subdivision Population structure Principal component