380 likes | 554 Views
From Correlation to liquid association. Why correlation is so popularly used? Positive correlation and negative correlation What is the limitation of similarity analysis?. Yeast Cell Cycle (adapted from Molecular Cell Biology, Darnell et al).
E N D
From Correlation to liquid association • Why correlation is so popularly used? • Positive correlation and negative correlation • What is the limitation of similarity analysis?
Yeast Cell Cycle(adapted from Molecular Cell Biology, Darnell et al)
Many examples are based onexploratory multivariate data analysis
Review • Linkage : Eisen et al , Alon et al • K-mean : Tavazoein et al • Self-organizing map : Tamayo et al • SVD : Holter et al; Alter, Brown, Botstein • Support vector machine (classification)
Finding Gene clusters • K-mean versus self-organization map • how to fine-tune user-specified parameters-need some theoretical guidance • What is a cluster ? Criteria needed • normal mixture, (hidden) indicator • PLAID model ( Statistica Sinica 2002, Lazzeroni, Owen) • PCA plot, projection pursuit, grand tour • MDS( bi-plot for categorical responses, showing both cases (genes) and variables(different clustering methods), displaying results from many different clustering procedures) • GAP (generalized association plot; Chun-Houh Chen cchen@stat.sinica.edu.tw)
Seriation and row-column sorting • Hierarchical clustering • Others • Generalized association plot (Chen 2001) • Sharp boundaries may be artifacts due to “clever” permutation
Gene profiles and correlation Rationale behind massive gene expression analysis: Genes with high degree of expression similarity are likely to be functionally related and may participate in common pathways. They may be co-regulated by common upstreamregulatory factors. Pearson's correlation coefficient, a simple way of describing the strength of linear association between a pair of random variables, has become the most popular measure of gene expression similarity. 1.Cluster analysis: average linkage, self-organizing map, K-mean, ... 2.Classification: nearest neighbor,linear discriminant analysis, support vector machine,… 3.Dimension reduction methods: PCA ( SVD)
CC has been used by Gauss, Bravais, Edgeworth … Sweeping impact in data analysis is due to Galton(1822-1911) “Typical laws of heridity in man” Karl Pearson modifies and popularizes the use. A building block in multivariate analysis, of which clustering, classification, dim. reduct. are recurrent themes As a statistician, how can you ignore the time order ? (Isn’t it true that the use of sample correlation relies on the assumption that data are I.I.D. ???)
Gene profiles and correlation • Not studying cell-cycle regulation - time order is not considered here; but ... • Synchronized cells enhance cellular signals: Start, 10 min, 20 min, 30 min, 40 min, 50 min, 60 min Task1,task2, task2, task2, task2, task3, task 4 • Suppose gene A and gene B participate in tasks 2, 3, but not in tasks 1 or 4. A positive correlation is expected. Gene B Gene A
Example : SCATTERPLOT MATRIX of MCM1,MCM2, MCM3, MCM4, MCM5, MCM6, MCM7, The tighter association among the six genes, MCM2,..., MCM7 is in a sharp contrast to the association between each of them and MCM1. It turns out that the gene products of MCM2,..,MCM7form ahexamericcomplex thatbinds chromatin.It is a part of pre-replicative complex, an assembly of proteins that form at origins of DNA replication between late M phase and the G1/S transition and includes other proteins believed to act inDNA replication initiation. MCM1 is a transcription factor of the MADS(MCm1p, Agamous, Deficiens, SRF)box family, which recruits coregulatory proteins for both gene activation and repression at a variety of loci. Note that MCM2-MCM7 fall into the "MCM" cluster of 34 cycle-cell regulated genes identified by Spellman et al(1998).
Negative correlation : Gene A (transcription factor) inactivates gene B Gene A and gene B are involved in conjugate tasks “Correlation is enough !” ??? Till I read the following from page 462 of Lodish et al (1995)
The thyroid hormone receptordiffers functionally from glucocorticoid receptor in two important respects : it binds to its DNA response elements in theabsenceof hormone, and the bound protein represses transcription rather than activating it. Whenthyroidhormone binds to the thyroid hormone receptor, the receptoris converted from a repressor to an activator. Gene A = gene producesTHR Gene B= gene regulated byTHR THR alone represses B THR+ HM activates B
Expression levels of A and B can be either positively correlated or negatively correlated, depending on thyroid hormone level. If during an experiment, hormone level fluctuated as organisms try to accomplish different tasks and if we cannot tell what tasks are, then ..... Of course, the book is not talking about yeast there. However, Pairwise similarity is not enough! THR alone represses B THR+ HM activates B
From CC to liquid association (LA) • Two examples • A challenge • (I will give a talk on this subject in Taipei • Conference, July 7-July 11)
Example 1. Positive-to-negative • X=ARP4,Y=LAS17, Z=MCM1 • Corr =0 in each plot • For low Z (marked points in A), X and Y are coexpressed • (B). For high Z (marked points in B), X and Y are contra-expressed Arp4 Protein that interacts with core histones, member of the NuA4 histone acetyltransferase complex; actin related protein Las17 Component of the cortical actin cytoskeleton
Example 2 -Negative to Positive • X=QCR9, Y= ROX1, Z=MCM1 • Corr=0 in each plot • For low Z (marked points in A), X and Y are contra-expressed • (B). For high Z (marked points in B), X and Y are co-expressed Rox1 Heme-dependent transcriptional repressor of hypoxic genes including CYC7(iso-2-cytochrome c ) and ANB1(translation initiation, ribosome) Qcr9 Ubiquinol cytochrome c reductase subunit 9
A Challenge • What genes behave like that ? • Can we identify all of them ? • N=5878 ORFs • N choose 3 = 33.8 billion triplets to inspect
Statistical theory for LA • X, Y, Z random variables with mean 0 and variance 1 • Corr(X,Y)=E(XY)=E(E(XY|Z))=Eg(Z) • g(z) an ideal summary of association pattern between X and Y when Z =z • g’(z)=derivative of g(z) • Definition. The LA of X and Y with respect to Z is LA(X,Y|Z)= Eg’(Z)
Statistical theory-LA • Theorem. If Z is standard normal, then LA(X,Y|Z)=E(XYZ) • Proof. By Stein’s Lemma : Eg’(Z)=Eg(Z)Z • =E(E(XY|Z)Z)=E(XYZ) • Additional math. properties: • bounded by third moment • =0, if jointly normal • transformation
Expression mastersGlycolysis/gluconeogenesis • Glycolysis : converts glucose to pruvate in 10 enzymatic reactions , generating 2 mol of ATP per mol of glucose • Gluconeogenesis: biosynthesis of glucose from noncarbohydrate precursors • A total of 35 yeast genes related to either pathway (MIPS)
Longevity-an episode • A recent Science article reveals the role of an yeast gene SCH9 in aging (Fabrizio et al 2001) • A part of intensified effort to establish “common processes control life-span of most organisms, including worms, flies, and possibly mammals” (Strass, 2001) Use SCH9 as Z, find LAPs in LA-ID database
ARG1, ARG2 : a LAP of SCH9 citrulline • ARG1 encodes argininosuccinate sythetaseL-citrulline + aspartate +ATP =AMP+ pyrophophate +N(omega)-(L-arginio)succinate • ARG2 encodes acetyglutamate synthase L-glutamate + N2-acetyl-L-ornithine - L-ornithine +N-acetyl-L-glutamate • Only 40 LAPs are saved in LA-ID database • 40 in 17 million • Pathway of arginine biosynthesis/urea cycle • Figure of ARG1, ARG2, SCH9 glutamate
SCH9-deleted strain has a longer life span than wild type (Fabrizio et al) • This suggests to us that longevity may be associated with the efficient use of the arginine biosynthesis pathway. Lack of coordination in expressing related enzymes is a likely reason for aging
ARG2 or EMC40(ARG7) • SGD, KEGG, NiceEnzyme ???? • ARG2 encodes acetylglutamate synthase • EMC40 encodes …. • From Voet and Voet • From page 18 of • Liver - urea cycle, deficiency • Kidney • Brain
Blue : low SCH9 • Red: high SCH9
Convergence of Molecular basis of aging • Yeast SCH9 is homologous to serine/thereonine kniase Akt/PKB of insulin signaling pathway in C. elegans (dauer arrest) • Caloric restriction; oxidative damage to neuron • Carbon source (most studies on yeast aging); nitrogen source (this study) • Glutamate is principal excitatory transmitter in CNS, implicated in long term potentiation- a phenomenon underlying learning, memory, habituation • GABA, one of two primary inhibitory transmitters, is formed from glutamate by loss of a carboxyl group • For fruit fly Drosophila and sea slug Aplysia, mutation in certain types of adenylate cyclase encoding genes can disrupt memory • Lodish et al, Molecular Biology, 1995
Modeling : a nightmare Current Next mRNA F I T N E S S mRNA Observed mRNA protein kinase hidden ATP, GTP, cAMP, etc Cytoplasm Nucleus Mitochondria Vacuolar localization F U N C T I O N DNA methylation, chromatin structure Nutrients- carbon, nitrogen sources Temperature Water