620 likes | 735 Views
Detecting Phenotype-Specific Interactions Between Biological Processes. Nadeem A. Ansari. Department of Computer Science Wayne State University Detroit, MI 48202. Outline. Biological background Motivation and problem description Challenges and limitations Mathematical background
E N D
Detecting Phenotype-Specific Interactions Between Biological Processes Nadeem A. Ansari Department of Computer Science Wayne State University Detroit, MI 48202
Outline • Biological background • Motivation and problem description • Challenges and limitations • Mathematical background • Detecting changed interactions between biological processes in a phenotype • Improvements • Results • Summary
Outline • Biological background • Motivation and problem description • Challenges and limitations • Mathematical background • Detecting changed interactions between biological processes in a phenotype • Improvements • Results
Cells, proteins, and DNA • Cells: fundamental units of life that contain all the working machinery necessary for their functioning • Proteins: the main contributors of this working machinery • Deoxyribonucleic acid (DNA): contains the blueprint for making the working machinery • Gene expression: the process of making the working machinery
DNA • Linear molecule of two strands; each composed of subunits called Nucleotides • Nucleotide types: Adenine – A Cytosine – C Guanine – G Thymine – T
DNA … A A C G G A T … … T T G C C T A… • Base pairing:
Transcription • Information stored in DNA letters is transcribed into Ribonucleic acid (RNA) • RNA: a chain of nucleotides - A, C, G, U (uracil) … G T G C A T … DNA … C A C G U A … RNA
Translation • Information stored in RNA is translated into chains of amino acids - proteins
Gene expression • The process of making the working machinery of a cell.
Regions of DNA that are synthesized into functional RNA and proteins are known as genes • An observable characteristic (or trait) of an organism caused by gene expression is known as a phenotype.
Gene expression measurement – why? • All cells contain same DNA – express genes selectively • Various stimuli cause change in gene expression • Change in expression level results in under or over production of working machinery • diseases / phenotypes • Measuring gene expression can help us understand underlying biological phenomenon
Gene expression measurements • Typically researchers measure gene expression in two different tissues or cell samples • Cells treated with a drug vs. untreated cells • Genes expressed differently than in a controlled sample are called differentially expressed (DE) genes • High throughput technologies like DNA microarrays measure expression levels of thousands of genes
Genes and annotations • Functional characteristics of gene products are stored in annotation databases like gene ontology • Gene Ontology (GO): a controlled and structured vocabulary • Molecular functions, biological processes, and cellular components • Structured as directed acyclic graphs (DAGs) • Nodes represent terms • Edges represent relationships • Parent-child relations (more than one parent) • Is-a, part-of, and regulates (negatively, positively)
Biological processes – GO subset • GO is a set of terms and their definitions organized in a structure that reflects their relationships • GO also provides a set of annotations, describing what is known about each gene (products)
Outline • Biological background • Motivation and problem description • Goals, Challenges and limitations • Mathematical background • Detecting changed interactions between biological processes in a phenotype • Improvements • Results
Motivation and problem description • Various stimuli cause differential gene expression, which results in the over and under production of proteins • Over and under production of proteins can result in the expression of a disease and disease-specific phenotype • Understanding genes behavior can help us understand diseases in ways never thought before – e.g. drug targets for curing diseases
Motivation and problem description • Current approaches look for the biological functions that are under or over represented in the phenotype-specific gene expression patterns • However, life is complex and biological functions also interact • These interactions change in a phenotype • Understanding changed interactions between biological functions is important in understanding the underlying biological mechanism that resulted in the phenotype
Outline • Biological background • Motivation and problem description • Goals, Challenges and limitations • Mathematical background • Detecting changed interactions between biological processes in a phenotype • Improvements • Results
Goals • Our goal is to detect the interactions between biological functions that have changed significantly in a given phenotype • We detect these interactions between the biological processes from GO annotated with differentially expressed genes in a phenotype
Challenges and limitations • There is no simple way to establish which biological functions are important • No universally accepted statistical model exists • Finding relationship between biological processes using mathematical models is challenging • No known statistical model exists that detects changed interactions in a given phenotype • Using GO annotations presents its own challenges
Challenges and limitations • GO is incomplete and updated on continuous basis • Missing information regarding gene annotations • GO contains inconsistencies • New research may make previous annotations obsolete • GO hierarchy poses challenge of dependencies • Genes annotated with specific terms are assumed to be annotated with all the ascendants of the annotated term
Outline • Biological background • Motivation and problem description • Goals, Challenges and limitations • Mathematical background • Detecting changed interactions between biological processes in a phenotype • Improvements • Results
Information retrieval (IR) • Problem: Given a query, find relevant documents from a collection • Vector space model (VSM) • Represent document and keywords in a matrix • Documents as columns with keywords as components – columns are document vectors • Represent query as a (column) vector • Find document vectors closer to query vector • Documents are relevant to query
Example – document retrieval Example taken from Berry et al., SIAM: Review 41, 2 (1999)
Example – document retrieval Term by document matrix
Example (IR VSM) • Document vector: • User searching for documents related to “baking bread” • Query vector:
Correlation • Determines if two random variables vary together • Linear correlation between X and Y: • Positive correlation - X increases as Y increases • Negative correlation - X decreases as Y increases • No linear correlation - no linear relationship (Pearson correlation coefficient)
Outline • Biological background • Motivation and problem description • Goals, Challenges and limitations • Mathematical background • Detecting changed interactions between biological processes in a phenotype • Improvements • Results
Detecting interactions that have changed significantly in the phenotype • Represent differentially expressed genes, in a phenotype, and their biological functions as a matrix – vector space model with biological processes as column vectors • Find associations between pairs of biological processes • Compare these associations with the corresponding associations in the absence of such phenotype • Detect association that are significantly different in the phenotype
Data inputs - genes and functions • Reference genes and functions set (R) • M genes on a microarray • N GO terms annotated with M genes • In a biological condition under study (E) • m < M differentially expressed (DE) genes • n <= N GO terms annotated with m DE genes
Gene function matrix – reference data Example gene-function matrix
Gene function matrix – experiment data Example gene-function matrix
Gene function matrix – reference and experiment Data • Experiment gene-function matrix is subpart of reference gene-function matrix
Challenges and limitations • GO is incomplete and updated on continuous basis • Missing information regarding gene annotations • GO contains inconsistencies • New research may make previous annotations obsolete • GO hierarchy poses challenge of dependencies • Genes annotated with specific terms are assumed to be annotated with all the ascendants of the annotated term
Our approach to solve challenges • Use singular value decomposition (SVD) • SVD can find missing relationships between genes and annotations in the latent semantic space and also remove noise from data • Noise: multiple words describing the same concepts • SVD is a factorization of a matrix into three matrices consisting of singular vectors and singular values corresponding to the original matrix
Singular value decomposition (SVD) • SVD of a GF matrix • Columns of matrix G (F) are left (right) singular vectors of GF • S is a diagonal matrix of singular values si. • The values on the main diagonal are ordered in non-increasing order and represent variability in data
Matrix approximation – dimensionality reduction • An approximated matrix can be computed by keeping only the first k largest singular values • We select k that retains the desired data variance (say x%) using the equation:
Approximated matrix – column view • We approximate both reference and experiment matrices • The approximated experiment gene-function matrix is not a sub-part of the approximated reference gene-function matrix
Correlation Between Functions • Indicates the strength and direction of a linear relationship between two biological processes • Pearson correlation coefficient rfi,fj between a pair of functions fi and fj is computed as: • Matrices (RRNxN and REnxn) of correlation coefficients are computed for reference and experiment data (respectively)
Pair-wise Correlation Coefficients for Reference and Experiment data • RRnxn contains the pair-wise correlation coefficients between the first n functions in the absence of phenotype =
Fisher Z Transform – Correlation Coefficient To Z-values • Correlation coefficients from samples of large population can be mapped to z values using Fisher z-transform, which approximates normal distribution • For a correlation coefficient r, the Fisher z-transform Zr can be computed as: • Compute ZRr from RRNxN and ZEr from REnxn
Detecting Changes Between Functional Interactions • Hypothesis:Correlation between two biological processes in the given phenotype differs from the correlation in the reference data Hypothesis Test statistic
Outline • Biological background • Motivation and problem description • Goals, Challenges and limitations • Mathematical background • Detecting changed interactions between biological processes in a phenotype • Improvements • Results
Improvements • The dependencies between GO terms can somewhat be removed using weights in our matrix.
Scheme 1-1 • This is a binary scheme and was discussed while describing our main method
Scheme 1-e • ei is the normalized log-transformed fold-change measured for gene gi in the given condition