120 likes | 275 Views
Guilt-by-Association: A non-metric tool for predicting gene relationships from expression data. Rebecca Shafee, Michael van Dam, Jim Brody * , Stephen Quake California Institute of Technology *Current address: University of California Irvine. Existing Algorithms.
E N D
Guilt-by-Association:A non-metric tool for predicting gene relationships from expression data Rebecca Shafee, Michael van Dam, Jim Brody*, Stephen Quake California Institute of Technology *Current address: University of California Irvine
Existing Algorithms • Numerous methods exist for “reverse engineering” gene relationships from expression array data: • Clustering Genes are grouped into clusters based on their “proximity” (e.g. Euclidean distance, linear correlation, non-linear correlation) in a multi-dimensional expression space. Result is a hierarchical structure containing all genes. • Dendrograms Genes are compared against one another using a metric, and then added to a binary tree in order of decreasing correlation, such that pairs with the highest correlation are closest in the tree. • Relevance networks 1,2,3 Use a probability function (e.g. based on mutual information or combinatorics) to estimate the probability that genes are independent based on their discretized gene expression vectors. Only connect genes with “significant” relation. Result is distinct gene networks with varying number of elements. • We developed a relevance network algorithm which uses a combinatoric probability function to compare pairs of genes. It is a 33 extension of a 22 algorithm developed by a group at Incyte Pharmaceuticals4,5,6.
Advantages 3,6 Our algorithm can handle... • Missing data: Methods based on Euclidean distance cannot handle genes with missing data, because it is not clear how their incomplete expression vectors should be oriented in the multi-dimensional space. • Negative regulation: Dendrograms, and methods based on Euclidean distance cannot cluster negatively correlated genes together, thus ignoring some important biological relationships. • Multiple regulation: Many clustering methods do not allow genes to belong to multiple clusters, and thus sometimes cannot accurately describe genes which are under the control of two or more regulatory factors. Dendrograms have the same problem. • Multifunction transcription factors: Methods based on a metric distance function cannot properly describe the relationship among two independent metabolic pathways which are regulated by a common transcription factor.
22 Co-Expression Algorithm 4,5,6 • Determine presence or absence of genes in cDNA libraries • Tabulate co-expression: • Compute probability that table could arise by random chance: • Low p-value, log[P(A,B)], indicates expression profiles are related
33 Co-Expression Algorithm - Part I • Expression ratio is calculated for each gene for each experiment: • Expression ratios are discretized: T = “noise” threshold; we used T = 0.5 • Example: X means “no data”
33 Co-Expression Algorithm - Part II • Tabulate co-expression (ignore experiments without data for both genes): • Compute p-value for each gene pair analogous to 22 case: • Low p-value indicates expression profiles are related
P-value Database • Collected Homo sapiens expression data from public sources: (1) Alizadeh, AA et al. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503-511. (2) Detweiler, CS et al. (2001) Host microarray analysis reveals a role for the Salmonella response regulator phoP in human macrophage cell death. Proc Natl Acad Sci USA 98(10):5850-5855. (3) Iyer, V et al. (1999) The transcriptional program in the response of human fibroblasts to serum. Science 283(5398):83-87. (4) Perou, CM et al. (1999) Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc Natl Acad Sci USA 96(16):9212-9217. (5) Perou, CM et al. (2000) Molecular portraits of human breast tumors. Nature 406:747-752. (6) Ross, D et al. (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics 24:227-235. • Computed p-values, and stored “significant” (low p-value) gene pairs in database • Web interface: http://perspiration.caltech.edu/~mvandam/gba/human.php • FUTURE WORK: • Continue to add expression data as it becomes available • Support other organisms • Investigate whether data should be segregated by tissue type / cell line / drug treatment • Group cDNA clones according to UniGene clusters, and calculate p-values between clusters
Histogram of P-values Comparison of relative frequency of p-values Actual: p-values computed from expression data for gene pairs Random: p-values expected if all genes are unrelated and all experiments are unrelated
Example I SSA1 Sjogren syndrome antigen A1 STAT1 signal transducer and activator of transcription 1 EEF1A1 Eukaryotic translation elongation factor 1 alpha 1 HLA-C major histocompatibility complex, class I, C MST1 macrophage stimulating 1 (hepatocyte growth factor-like) FCGR2B Fc fragment of IgG, low afinity IIb, receptor for CD32 TYROBP TYRO protein tyrosine kinase binding protein CD36 CD36 antigen (collagen type I receptor, thrombospondin receptor)
Example II • SS18 Synovial sarcoma translocation, chromosome 18 • NLP_1 Nucleoporin-like protein 1 • KIAA0036 KIAA0036 gene product • RFXAP Regulatory factor X-associated protein • WRN Werner syndrome • WASPIP Wiskott-Aldrich syndrome protein interacting protein • IDE Insulin-degrading enzyme • ARP3 Actin-related protein 3, yeast homolog • ARP2/3 Actin related protein 2/3 complex, subunit 2 (34 kD) • HHLP Human helicase-like protein (HLP) mRNA
EXAMLE II… continued Observed relationships and Known Facts: • Soft tissue sarcomas occur more frequently in patients with Werner Syndrome. [SS18 and WRN] Brennan MF, Casper ES, Harrison LB: Soft tissue sarcoma. In: DeVita VT Jr, Hellman S, Rosenberg SA, eds.: Cancer: Principles and Practice of Oncology. Philadelphia, Pa: Lippincott-Raven Publishers, 5th ed., 1997, pp 1738-1788 • “WASP proteins serve as a common platform, bringing together components of signal transduction pathways with cellular machinery that promotes actin polymerization and microfilament reorganization.” [WASIP and ARP3, ARP 2/3] http://sdb.bio.purdue.edu/fly/cytoskel/wasp1.htm 3. The Werner Syndrome Gene encodes a protein whose central domain is homologous to members of the RecQ family of DNA helicases. [WRN and HHLP] Gray, M D et. al:”The Werner Syndrome protein is a DNA helicase”, Nature: Vol 17, No.1
References [1] Butte, A.J. and Kohane, I.S. Mutual information relevance networks: Functional genomic clustering using pairwise entropy measurements. In: Pacific Symposium on Biocomputing 2000: p418-429. Hawaii: World Scientific, 2000. [2] Butte, A.J., Tamayo, P., Slonim, D. et al. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proceedings of the National Academy of Science USA97(22):12182-12186, 2000. [3] Butte, A. Advantages of relevance networks over other bioinformatics analysis tools in functional genomics. [http://www.xpogen.com/Xpogen_Relevance_Networks.pdf, accessed 2001/10/25] [4] Walker, M.G., Volkmuth, W., Sprinzak, E., Hodgson, D. and Klingler, T.M. Prediction of gene function by genome-scale expression analysis: Prostate cancer-associated genes. Genome Research 9(12): 1198-1203, 1999. [5] Walker, M.G., Volkmuth, W., Klingler, T.M. Pharmaceutical target discovery using Guilt-by-Association: schizophrenia and Parkinson’s disease genes. Proceedings of the International Conference on Intelligent Systems for Molecular Biology146: 282-286, 1999. [6] Walker, M.G. Drug target discovery by gene expression analysis: Cell cycle genes. Current Cancer Drug Targets1(1), 2001.