1 / 53

Outlines

A simple statistical model for deciphering the cdc15-synchronized yeast cell cycle-regulated genes expression data Ker-Chau Li , Robert Yuan Statistics, UCLA Ming Yan Biochemistry , UCLA.

ciqala
Download Presentation

Outlines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A simple statistical modelfor deciphering the cdc15-synchronized yeastcell cycle-regulated genes expression dataKer-Chau Li , Robert Yuan Statistics, UCLAMing Yan Biochemistry , UCLA

  2. The goal of this study is to demonstrate how simple statistical models can be employed for helping the organization and explanation ofcomplex gene expression patterns

  3. Outlines • Introd : Micro-array and cell-cycle • Data : cdc15 experiment • A statistical model • Phase determination • Comparison with Spellman et al(1998) • Regularly oscillated genes • Further discussion

  4. MicroArray • Allows measuring the mRNA level of thousands of genes in one experiment -- system level response • The data generation can be fully automated by robots • Common experimental themes: • Time Course • Mutation/Knockout Response

  5. Yeast Cell Cycle(adapted from Molecular Cell Biology, Darnell et al)

  6. The data set available at http:cellcycle-www.standford.edu We focus on one experiment in which a strain of yeast(cdc15-2) was incubated at a high temperature(35 degrees C) for a long time, causing cdc15 arrest. Cells were then shifted back to a low temperature( 23 degrees C) and the monitoring of gene expression is taken every 10 min for 300 min.

  7. Data from some chips are not available We concentrate on those from the 19 Consecutive time points from 70 mins To 250 mins 24 Time points: (mins) 10 30 50 70 80 ..... 240 250 270 290 ----------> 10 mins apart Use of full data will be discussed later. Genes with missing values are also Deleted There are 4530 genes remaining The data can be represented by a 4530 by 19 matrix

  8. Example of the time curve: Histone Genes: (HTT2) ORF: YNL031C Time course:

  9. YKL164C YNL082W

  10. Preliminary study with two-way anova This is to investigate the constancy of average expression Level over the time for each gene and the constancy of The average expression level over all genes at each time Point. > cdc15 Factor df SS MS F gene | 4529 | 5.2408E+2 | 1.1572E-1 | 6.4169E-1 time | 18 | 2.9745E+2 | 1.6525E+1 | 9.1638E+1 residual |81522 | 1.4701E+4 | 1.8033E-1 total |86069 | 1.5522E+4 Gene insignificant Time appears statistically significant; But …………(next slide)

  11. Column mean (Time) from Anova result The values are small The expression level is log_2 of ratio of red/green Red = light intensity for red channel - “noise” Green = light intensity of green channel - “noise” Red channel = mRNA from cells at one time point Green channel =mRNA from unsynchronized cells .5 fold increase = log_2 1.5=.585 ; 2^.15 =1.11=.11 fold increase

  12. A statistical model • Motivation : modeling each curve with simple functions such as linear, quadratic, sine, cosine appears reasonable but inflexible; • Parsimony and accuracy can be gained if basis curves are chosen by data themselves • The model : each gene expression curve =

  13. The model -continued The errors have mean zero, uncorrelated ,same variance cross the time; But the variance may depend on genes (This is important) It turns out that we can find the basis functions from an application of PCA. (see pdf file for pca)

  14. Enhanced PCA for curve fitting Choose the number of basis curves by eigenvalues Assess the goodness of each curve fitting by R-squared and by residual sum of squares Identify genes that comply well to the model Interactive plotting helps resetting user-specified parameters

  15. PCA: For a list of vectors, PCA could be used for finding the common basis based on the scaling matrix. Covariance Matrix: The directions found will have highest variance along those directions. Find the directions by eigenvalue decomposition: Model the curves by the PCA directions: Here, we chose first three PCA directions as our basis.

  16. 1st PCA direction 2nd PCA direction 3rd PCA direction Eigenvalues

  17. 1. Compliance Check: Reject if (Corr. Coff between fit and observed < .75 And error s.d. Bigger than .70 , which is equivalent to .5 fold increase.) 2. Cycle Component Check: Reject if 3. Smoothness Check: Reject if

  18. Noncompliance genes (41) . High overall expression levels . May or may not show cycle patterns … Recommendation : inspect each gene separately

  19. Phase determination • The second and the third basis curves show clear cycle patterns. The third basis appears to be a 40 min-delayed version of the second basis, with an R-squared value of .78 • Linear combinations of these two basis curves show a variety of expression patterns.

  20. Construction of A Compass plot • Use of known cycle-regulated genes • Compliance checking with RSS/R^2 plot • Cycle- exhibition checking with projection angles • Coherent pattern checking by ANOVA • ( A list of 104 known genes with 6 groups)

  21. Phases of genes: Identify the phases of genes: Prior Knowledge: There were 104 know genes whose phases were determined by traditional experiment methods. Known genes: There are 6 groups of genes. SCB (G1 phase) MCB (G1 phase) Histone (S phase) S/G2 phase G2/M phase M/G1 phase The noncompliance genes and without significant cycle components are excluded The group of genes, SCB, are also excluded due to the inconsistent patterns within their expression vectors.

  22. 82 non-missing known phase genes Remove genes with insignificant cycle component Points obtained by normalizing the loading coeff. for 2nd and 3rd bases to unit length

  23. Late G1, SCB regulated genes:

  24. Compass plot for phase assignment Histone genes S G1 S/G2 M/G1 G2/M

  25. Phase Assignment Smooth Non-smooth S S G1 G1 31 S/G2 S/G2 27 108 103 352 255 90 295 165 M/G1 239 90 G2/M M/G1 G2/M

  26. Comparison • For the 800 cell-regulated genes classified by Spellman et al, we re-classified them with our method. If a gene does not comply with our model or does not have significant second or third regression coefficients, we would not assign the phase. • Contingency tables of mismatched and unclassified cases.

  27. Locus_info: Other_name PIR2 YJL159W CCW7 ORE1 Gene_class HSP Gene_Info HSP150 Gene_product Heat shock protein, secretory glycoprotein Function cell wall structural protein Cellular_Component cell wall Process cell wall organization and biogenesis Phenotype Null mutant is viable Locus_notes 14 HSP150 has also been called gp400 Position_info: Chromosome X ORF_name YJL159W A non-compliance gene YJL159W : Spellman et.al’s Score : 10.86 R2: 0.36273 (M/G1) RSS: 14.15322 Angle: -2.43803 Least Squares Estimates: Constant -4.794002E-16 (0.222846) Variable 0 1.28464 (0.971364) Variable 1 -2.04016 (0.971364) Variable 2 -1.49779 (0.971364) Black: data curve Red : fitted curve (full model) Blue : fitted curve (cyclic model)

  28. An example of our non-compliance gene YDR055W : Spellman et.al’s Score : 7.266 R2: 0.30136 (M/G1) RSS: 7.94018 Angle: -2.81396 (Insig. Coef.) Locus_info: Other_name YDR055W Gene_class PST Gene_Info PST1 Description Protoplasts-secreted Gene_product The gene product has been detected among the proteins secreted by regenerating protoplasts Phenotype Viable Position_info: Chromosome IV ORF_name YDR055W Least Squares Estimates: Constant -5.428720E-16 (0.166914) Variable 0 1.47329 (0.727561) Variable 1 -1.07451 (0.727561) Variable 2 -0.316032 (0.727561) Black: data curve Red : fitted curve (full model) Blue : fitted curve (cyclic model)

  29. An example of non-compliance gene YNL082W : Spellman et.al’s Score : 4.843 R2: 0.229191 (G1) RSS: 18.247480537500003 Least Squares Estimates: Constant -6.087129E-16 (0.253035) Variable 0 1.51725 (1.10295) Variable 1 -1.74757 (1.10295) Variable 2 0.263945 (1.10295) Black: data curve Red : fitted curve (full model) Blue : fitted curve (cyclic model)

  30. Top 10 scores and gene names from insignificant Cycle component group 3.69 3.85 3.874 4.022 4.048 4.13 4.41 5.047 6.28 6.716 "YOR263C" "YOR320C" "YGR035C" "YCR042C" "YPR019W” "YJL194W" "YJR010W" "YEL068C" "YGR124W" "YKL172W" 78 genes score higher than 6.716; 188 genes score higher than 4.022 213 genes score higher than 3.69 Yet these genes appear very bumpy; see next slide

  31. An example of insignificant cycle component gene YGR124W : Spellman et.al’s Score: 6.28 (S/G2) R2: 0.364945 (small) RSS: 0.812496 (small) Angle: 3.13118 Locus_info: Other_name YGR124W Gene_class ASN Gene_Info ASN2 Description Asn1p and Asn2p are isozymes Gene_product asparagine synthetase Phenotype Null mutant is viable; L- asparagine auxotrophy occurs upon mutation of both ASN1 and ASN2 Position_info: Chromosome VII ORF_name YGR124W 250 mins CDC15 70 mins

  32. EBP2: YKL172W TSM1: YCR042C YOR263C

  33. Non-smooth group from 800 genes Our\their G1 S S/G2 G2/M M/G1 Total G1 59 6 0 0 0 | 65 S 4 3 0 0 0 | 7 S/G2 1 7 31 17 0 | 56 G2/M 0 0 3 47 1 | 51 M/G1 18 0 0 4 21 | 43 Total 82 16 34 68 22 | 222 Smooth group from 800 genes Low overall expression level Our\their G1 S S/G2 G2/M M/G1 Total G1 74 8 0 0 1 | 83 S 7 10 1 0 0 | 18 S/G2 5 11 43 17 1 | 77 G2/M 0 0 1 39 1 | 41 M/G1 43 0 0 3 28 | 74 Total 129 29 45 59 31 | 293

  34. HTA1: YDR225W CLN2: YPL256C (S) (G1) CLB4: YLR210W YJL091C (S/G2) (Phase ??)

  35. CLN2: YPL256C HTA1: YDR225W (S) (G1) FKS1: YLR342W CLB4: YLR210W (Phase ??) (S/G2) From 5 cell

  36. From 1 , total SS small Least Squares Estimates: Constant -5.706461E-16 (4.704328E-2) Variable 0 -0.170979 (0.205057) Variable 1 0.479678 (0.205057) Variable 2 0.762583 (0.205057) R Squared: 0.571396 Sigma hat: 0.205057 Number of cases: 19 Degrees of freedom: 15 YOR264W

  37. Oscillated genes • First curve basis is oscillating in a extremely regular way • There are over 200 genes with such regular oscillating patterns • Role unknown : Systematic error ? Common upstream promoter region ?

  38. DIM1 (YPL266W) Locus_info: Other_name YPL266W Gene_class DIM Gene_Info DIM1 Description Dimethyladenosine transferase, (rRNA(adenine-N6,N6-)-dimethyltransferase),reponsible for m6[2]Am6[2]A dimethylation in 3'-terminal loop of 18S rRNA Gene_product dimethyladenosine transferase Function rRNA (adenine-N6,N6-)-dimethyltransferase Cellular_Component nucleolus Process 35S primary transcript processing rRNA modification Phenotype Null mutant is inviable Position_info: Chromosome XVI ORF_name YPL266W

  39. PRS1A (YLR441C) Locus_info: Other_name YLR441C RP10A Gene_class RPS Gene_Info RPS1A Description Homologous to rat S3A Gene_product Ribosomal protein S1A (rp10A) Function structural protein of ribosome Cellular_Component cytosolic small ribosomal (40S)-subunit Process 0006416 protein biosynthesis Locus_notes 13 RP10A (RPS1A) and RP10B (RPS1B) are nearly identical; this gene has also been called PLC1, but should not be confused with PLC1 on chromosome XVI encoding a phosphoinositide-specific phospholipase Position_info: Chromosome XII ORF_name YLR441C

  40. One gene from non-smooth group Not in Spellman et. al.’s list. GLN1: YPR035W Least Squares Estimates: Constant -6.276471E-16 (4.762055E-2) Variable 0 -2.47649 (0.207573) Variable 1 3.958405E-2 (0.207573) Variable 2 1.01860 (0.207573) R Squared: 0.917337 Sigma hat: 0.207573

  41. Further discussion • Others who use PCA • Clustering • Other data set • Use of SIR/PHD • Without a time scale ? B-cell lymphoma data • Pathway study

  42. . Genes with overall small expression levels could have been Removed from the beginning??? One gene from smooth group Not in Spellman et. al.’s list. YGR231C Least Squares Estimates: Constant -5.803153E-16 (4.131369E-2) Variable 0 -0.156478 (0.180082) Variable 1 -1.59995 (0.180082) Variable 2 -0.623201 (0.180082) R Squared: 0.859375 Sigma hat: 0.180082 Total sum of squares equals to 3.4591 which is about 71.6 percentile among all genes. The median of the total sum of squares is 2.27735.

  43. THE END

More Related