850 likes | 996 Views
From Correlation to liquid association. Why correlation is so popularly used? Positive correlation and negative correlation What is the limitation of similarity analysis? The concept of liquid association. Liquid Association (LA). Two examples A challenge. Liquid Association (LA).
E N D
From Correlation to liquid association • Why correlation is so popularly used? • Positive correlation and negative correlation • What is the limitation of similarity analysis? • The concept of liquid association
Liquid Association (LA) • Two examples • A challenge
Liquid Association (LA) • LA is a generalized notion of association for describing certain kind of ternary relationship between variables in a system. (Li 2002 PNAS) • Green points represent four conditions for cellular state 1. • Red points represent four conditions for cellular state 2. • Blue points represent the transit state between cellular states 1 and 2. • (X,Y) forms a LA. Profiles of genes X and Y are displayed in the above scatter plot. Important! Correlation between X and Y is 0
Correlation Coefficienthas been used by Gauss, Bravais, Edgeworth … Sweeping impact in data analysis is due to Galton(1822-1911) “Typical laws of heredity in man” Karl Pearsonmodifies and popularizes its use. A building block in multivariate analysis, of which clustering, classification, dimension reduction are recurrent themes
A note on correlation • Given two variables, X and Y, the concept of correlation between has several variations. • Correlation intends to measure the degree of coordinated changes. • Original definition by Galton used median as the center
Regression, correlation • Sir Francis Galton (1822-1911), • half-cousin of Charles Darwin, • was an EnglishVictorian polymath, anthropologist, eugenicist, tropical explorer, geographer, inventor, meteorologist, proto-geneticist, psychometrician, and statistician. He was knighted in 1909. • Galton invented the use of the regression line (Bulmer 2003, p. 184), and was the first to describe and explain the common phenomenon of regression toward the mean, which he first observed in his experiments on the size of the seeds of successive generations of sweet peas. Bivariate normal
Provide an introduction to regression and inverse regression. • Galton’s data. • Positive : height v.s. speed • Negative : weight w.s. speed
Low level analysis • Convert an image into a number representing the ratio of the levels of expression between red and green channels • Color bias • Spatial, tip, spot effects • Background noises • cDNA, oligonucleotide arrays (Affymetrix) (see review article ) • Raw Data processing packages: RMA (Robust Multi-array Average ) dChip MAS5 Chip to chip normalization (set “average expression level” in each chip to a common value in certain sense) Recommendation : try more than one method.
gene-expression data cond1 cond2 …….. condp gene1gene2 gene n x11 x12 …….. x1p x21 x22 …….. x2p … …
Yeast Cell Cycle(adapted from Molecular Cell Biology, Darnell et al)
Many examples are based onexploratory multivariate data analysis
Review • Linkage : Eisen et al , Alon et al • K-mean : Tavazoein et al • Self-organizing map : Tamayo et al • SVD : Holter et al; Alter, Brown, Botstein • Support vector machine (classification)
Finding Gene clusters • K-mean versus self-organization map • how to fine-tune user-specified parameters-need some theoretical guidance • What is a cluster ? Criteria needed • normal mixture, (hidden) indicator • PLAID model ( Statistica Sinica 2002, Lazzeroni, Owen) • PCA plot, projection pursuit, grand tour • MDS( bi-plot for categorical responses, showing both cases (genes) and variables(different clustering methods), displaying results from many different clustering procedures) • GAP (generalized association plot; Chun-Houh Chen cchen@stat.sinica.edu.tw)
Seriation and row-column sorting • Hierarchical clustering • Others • Generalized association plot (Chen 2001) • Sharp boundaries may be artifacts due to “clever” permutation
Gene profiles and correlation Rationale behind massive gene expression analysis: Genes with high degree of expression similarity are likely to be functionally related and may participate in common pathways. They may be co-regulated by common upstreamregulatory factors. Pearson's correlation coefficient, a simple way of describing the strength of linear association between a pair of random variables, has become the most popular measure of gene expression similarity. 1.Cluster analysis: average linkage, self-organizing map, K-mean, ... 2.Classification: nearest neighbor,linear discriminant analysis, support vector machine,… 3.Dimension reduction methods: PCA ( SVD)
CC has been used by Gauss, Bravais, Edgeworth … Sweeping impact in data analysis is due to Galton(1822-1911) “Typical laws of heridity in man” Karl Pearson modifies and popularizes the use. A building block in multivariate analysis, of which clustering, classification, dim. reduct. are recurrent themes As a statistician, how can you ignore the time order ? (Isn’t it true that the use of sample correlation relies on the assumption that data are I.I.D. ???)
Gene profiles and correlation • Not studying cell-cycle regulation - time order is not considered here; but ... • Synchronized cells enhance cellular signals: Start, 10 min, 20 min, 30 min, 40 min, 50 min, 60 min Task1,task2, task2, task2, task2, task3, task 4 • Suppose gene A and gene B participate in tasks 2, 3, but not in tasks 1 or 4. A positive correlation is expected. Gene B Gene A
Example : SCATTERPLOT MATRIX of MCM1,MCM2, MCM3, MCM4, MCM5, MCM6, MCM7, The tighter association among the six genes, MCM2,..., MCM7 is in a sharp contrast to the association between each of them and MCM1. It turns out that the gene products of MCM2,..,MCM7form ahexamericcomplex thatbinds chromatin.It is a part of pre-replicative complex, an assembly of proteins that form at origins of DNA replication between late M phase and the G1/S transition and includes other proteins believed to act inDNA replication initiation. MCM1 is a transcription factor of the MADS(MCm1p, Agamous, Deficiens, SRF)box family, which recruits coregulatory proteins for both gene activation and repression at a variety of loci. Note that MCM2-MCM7 fall into the "MCM" cluster of 34 cycle-cell regulated genes identified by Spellman et al(1998).
Negative correlation : Gene A (transcription factor) inactivates gene B Gene A and gene B are involved in conjugate tasks “Correlation is enough !” ??? Till I read the following from page 462 of Lodish et al (1995)
The thyroid hormone receptordiffers functionally from glucocorticoid receptor in two important respects : it binds to its DNA response elements in theabsenceof hormone, and the bound protein represses transcription rather than activating it. Whenthyroidhormone binds to the thyroid hormone receptor, the receptoris converted from a repressor to an activator. Gene A = gene producesTHR Gene B= gene regulated byTHR THR alone represses B THR+ HM activates B
Expression levels of A and B can be either positively correlated or negatively correlated, depending on thyroid hormone level. If during an experiment, hormone level fluctuated as organisms try to accomplish different tasks and if we cannot tell what tasks are, then ..... Of course, the book is not talking about yeast there. However, Pairwise similarity is not enough! THR alone represses B THR+ HM activates B
Transcription factors: proteins that bind to DNAActivator; repressors
Why clustering make sense biologically? The rationale is Genes with high degree ofexpression similarityare likely to befunctionally related. may form structural complex, may participate incommon pathways. may be co-regulated bycommon upstreamregulatory elements. Simply put, Profile similarity implies functional association
However, the converse is not true The expression profiles of majority of functionally associated genes are indeed uncorrelated • Microarray is too noisy • Biology is complex
Why no correlation? • Protein rarely works alone • Protein has multiple functions • Different biological processes or pathways have to be synchronized • Competing use of finite resources : metabolites, hormones, • Protein modification: Phosphorylation, proteolysis, shuttle, … Transcription factors serving both as activators and repressors
Going subtle:Protein modification Histone inhibits transcription To activate transcription, the lysine side chain must be acetylated. Weaver(2001)
Corepressor : histone deacetylase Thyroid hormone Coactivator: Histone acetyltransferase
Math. Modeling : a nightmare Current Next mRNA F I T N E S S mRNA Observed mRNA protein kinase hidden ATP, GTP, cAMP, etc Cytoplasm Nucleus Mitochondria Vacuolar localization F U N C T I O N Statistical methods become useful DNA methylation, chromatin structure Nutrients- carbon, nitrogen sources Temperature Water
What is LA? Concept of “mediator”
From CC to liquid association (LA) • Two examples • A challenge
Example 1. Positive-to-negative • X=ARP4,Y=LAS17, Z=MCM1 • Corr =0 in each plot • For low Z (marked points in A), X and Y are coexpressed • (B). For high Z (marked points in B), X and Y are contra-expressed Arp4 Protein that interacts with core histones, member of the NuA4 histone acetyltransferase complex; actin related protein Las17 Component of the cortical actin cytoskeleton
Example 2 -Negative to Positive • X=QCR9, Y= ROX1, Z=MCM1 • Corr=0 in each plot • For low Z (marked points in A), X and Y are contra-expressed • (B). For high Z (marked points in B), X and Y are co-expressed Rox1 Heme-dependent transcriptional repressor of hypoxic genes including CYC7(iso-2-cytochrome c ) and ANB1(translation initiation, ribosome) Qcr9 Ubiquinol cytochrome c reductase subunit 9
A Challenge • What genes behave like that ? • Can we identify all of them ? • N=5878 ORFs • N choose 3 = 33.8 billion triplets to inspect
Liquid Association (LA) • LA is a generalized notion of association for describing certain kind of ternary relationship between variables in a system. (Li 2002 PNAS) • Green points represent four conditions for cellular state 1. • Red points represent four conditions for cellular state 2. • Blue points represent the transit state between cellular states 1 and 2. • (X,Y) forms a LA. Profiles of genes X and Y are displayed in the above scatter plot. Important! Correlation between X and Y is 0
Statistical theory for LA • X, Y, Z random variables with mean 0 and variance 1 • Corr(X,Y)=E(XY)=E(E(XY|Z))=Eg(Z) • g(z) an ideal summary of association pattern between X and Y when Z =z • g’(z)=derivative of g(z) • Definition. The LA of X and Y with respect to Z is LA(X,Y|Z)= Eg’(Z)
Statistical theory-LA • Theorem. If Z is standard normal, then LA(X,Y|Z)=E(XYZ) • Proof. By Stein’s Lemma : Eg’(Z)=Eg(Z)Z • =E(E(XY|Z)Z)=E(XYZ) • Additional math. properties: • bounded by third moment • =0, if jointly normal • transformation
Normality ? • Convert each gene expression profile by taking normal score transformation • LA(X,Y|Z) = average of triplet product of three gene profiles: (x1y1z1 + x2y2z2 + …. ) / n
Figure 3. Organization chart for incorporating LA with similarity based methods. Co-expressed genes found by profile similarity analysis can be pooled together to obtain a consensus profile for LA-scouting. Likewise, the genes identified through LA system can be further analyzed for patterns of clustering. For some applications, the scouting variable may come from external sources related to the expression profiles. SVD: singular value decomposition; PCA: principal component analysis.
How does LA work in yeast? Urea cycle/arginine biosynthesis
Yeast Cell Cycle(adapted from Molecular Cell Biology, Darnell et al) Most visible event
Expression mastersGlycolysis/gluconeogenesis • Glycolysis : converts glucose to pruvate in 10 enzymatic reactions , generating 2 mol of ATP per mol of glucose • Gluconeogenesis: biosynthesis of glucose from noncarbohydrate precursors • A total of 35 yeast genes related to either pathway (MIPS)
Longevity-an episode • A recent Science article reveals the role of an yeast gene SCH9 in aging (Fabrizio et al 2001) • A part of intensified effort to establish “common processes control life-span of most organisms, including worms, flies, and possibly mammals” (Strass, 2001) Use SCH9 as Z, find LAPs in LA-ID database
ARG1, ARG2 : a LAP of SCH9 citrulline • ARG1 encodes argininosuccinate sythetaseL-citrulline + aspartate +ATP =AMP+ pyrophophate +N(omega)-(L-arginio)succinate • ARG2 encodes acetyglutamate synthase L-glutamate + N2-acetyl-L-ornithine - L-ornithine +N-acetyl-L-glutamate • Only 40 LAPs are saved in LA-ID database • 40 in 17 million • Pathway of arginine biosynthesis/urea cycle • Figure of ARG1, ARG2, SCH9 glutamate