1 / 62

Detecting Phenotype-Specific Interactions Between Biological Processes

Detecting Phenotype-Specific Interactions Between Biological Processes. Nadeem A. Ansari. Department of Computer Science Wayne State University Detroit, MI 48202. Outline. Biological background Motivation and problem description Challenges and limitations Mathematical background

lyn
Download Presentation

Detecting Phenotype-Specific Interactions Between Biological Processes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting Phenotype-Specific Interactions Between Biological Processes Nadeem A. Ansari Department of Computer Science Wayne State University Detroit, MI 48202

  2. Outline • Biological background • Motivation and problem description • Challenges and limitations • Mathematical background • Detecting changed interactions between biological processes in a phenotype • Improvements • Results • Summary

  3. Outline • Biological background • Motivation and problem description • Challenges and limitations • Mathematical background • Detecting changed interactions between biological processes in a phenotype • Improvements • Results

  4. Cells, proteins, and DNA • Cells: fundamental units of life that contain all the working machinery necessary for their functioning • Proteins: the main contributors of this working machinery • Deoxyribonucleic acid (DNA): contains the blueprint for making the working machinery • Gene expression: the process of making the working machinery

  5. DNA • Linear molecule of two strands; each composed of subunits called Nucleotides • Nucleotide types: Adenine – A Cytosine – C Guanine – G Thymine – T

  6. DNA … A A C G G A T … … T T G C C T A… • Base pairing:

  7. Transcription • Information stored in DNA letters is transcribed into Ribonucleic acid (RNA) • RNA: a chain of nucleotides - A, C, G, U (uracil) … G T G C A T … DNA … C A C G U A … RNA

  8. Translation • Information stored in RNA is translated into chains of amino acids - proteins

  9. Gene expression • The process of making the working machinery of a cell.

  10. Regions of DNA that are synthesized into functional RNA and proteins are known as genes • An observable characteristic (or trait) of an organism caused by gene expression is known as a phenotype.

  11. Gene expression measurement – why? • All cells contain same DNA – express genes selectively • Various stimuli cause change in gene expression • Change in expression level results in under or over production of working machinery • diseases / phenotypes • Measuring gene expression can help us understand underlying biological phenomenon

  12. Gene expression measurements • Typically researchers measure gene expression in two different tissues or cell samples • Cells treated with a drug vs. untreated cells • Genes expressed differently than in a controlled sample are called differentially expressed (DE) genes • High throughput technologies like DNA microarrays measure expression levels of thousands of genes

  13. Genes and annotations • Functional characteristics of gene products are stored in annotation databases like gene ontology • Gene Ontology (GO): a controlled and structured vocabulary • Molecular functions, biological processes, and cellular components • Structured as directed acyclic graphs (DAGs) • Nodes represent terms • Edges represent relationships • Parent-child relations (more than one parent) • Is-a, part-of, and regulates (negatively, positively)

  14. Biological processes – GO subset • GO is a set of terms and their definitions organized in a structure that reflects their relationships • GO also provides a set of annotations, describing what is known about each gene (products)

  15. Outline • Biological background • Motivation and problem description • Goals, Challenges and limitations • Mathematical background • Detecting changed interactions between biological processes in a phenotype • Improvements • Results

  16. Motivation and problem description • Various stimuli cause differential gene expression, which results in the over and under production of proteins • Over and under production of proteins can result in the expression of a disease and disease-specific phenotype • Understanding genes behavior can help us understand diseases in ways never thought before – e.g. drug targets for curing diseases

  17. Motivation and problem description • Current approaches look for the biological functions that are under or over represented in the phenotype-specific gene expression patterns • However, life is complex and biological functions also interact • These interactions change in a phenotype • Understanding changed interactions between biological functions is important in understanding the underlying biological mechanism that resulted in the phenotype

  18. Outline • Biological background • Motivation and problem description • Goals, Challenges and limitations • Mathematical background • Detecting changed interactions between biological processes in a phenotype • Improvements • Results

  19. Goals • Our goal is to detect the interactions between biological functions that have changed significantly in a given phenotype • We detect these interactions between the biological processes from GO annotated with differentially expressed genes in a phenotype

  20. Challenges and limitations • There is no simple way to establish which biological functions are important • No universally accepted statistical model exists • Finding relationship between biological processes using mathematical models is challenging • No known statistical model exists that detects changed interactions in a given phenotype • Using GO annotations presents its own challenges

  21. Challenges and limitations • GO is incomplete and updated on continuous basis • Missing information regarding gene annotations • GO contains inconsistencies • New research may make previous annotations obsolete • GO hierarchy poses challenge of dependencies • Genes annotated with specific terms are assumed to be annotated with all the ascendants of the annotated term

  22. Outline • Biological background • Motivation and problem description • Goals, Challenges and limitations • Mathematical background • Detecting changed interactions between biological processes in a phenotype • Improvements • Results

  23. Information retrieval (IR) • Problem: Given a query, find relevant documents from a collection • Vector space model (VSM) • Represent document and keywords in a matrix • Documents as columns with keywords as components – columns are document vectors • Represent query as a (column) vector • Find document vectors closer to query vector • Documents are relevant to query

  24. Example – document retrieval Example taken from Berry et al., SIAM: Review 41, 2 (1999)

  25. Example – document retrieval

  26. Example – document retrieval Term by document matrix

  27. Example (IR VSM) • Document vector: • User searching for documents related to “baking bread” • Query vector:

  28. Finding relevant (similar) documents

  29. Correlation • Determines if two random variables vary together • Linear correlation between X and Y: • Positive correlation - X increases as Y increases • Negative correlation - X decreases as Y increases • No linear correlation - no linear relationship (Pearson correlation coefficient)

  30. Pearson correlation coefficient – geometric interpretation

  31. Outline • Biological background • Motivation and problem description • Goals, Challenges and limitations • Mathematical background • Detecting changed interactions between biological processes in a phenotype • Improvements • Results

  32. Detecting interactions that have changed significantly in the phenotype • Represent differentially expressed genes, in a phenotype, and their biological functions as a matrix – vector space model with biological processes as column vectors • Find associations between pairs of biological processes • Compare these associations with the corresponding associations in the absence of such phenotype • Detect association that are significantly different in the phenotype

  33. Data inputs - genes and functions • Reference genes and functions set (R) • M genes on a microarray • N GO terms annotated with M genes • In a biological condition under study (E) • m < M differentially expressed (DE) genes • n <= N GO terms annotated with m DE genes

  34. Gene function matrix – reference data

  35. Gene function matrix – reference data Example gene-function matrix

  36. Gene function matrix – experiment data Example gene-function matrix

  37. Gene function matrix – reference and experiment Data • Experiment gene-function matrix is subpart of reference gene-function matrix

  38. Challenges and limitations • GO is incomplete and updated on continuous basis • Missing information regarding gene annotations • GO contains inconsistencies • New research may make previous annotations obsolete • GO hierarchy poses challenge of dependencies • Genes annotated with specific terms are assumed to be annotated with all the ascendants of the annotated term

  39. Our approach to solve challenges • Use singular value decomposition (SVD) • SVD can find missing relationships between genes and annotations in the latent semantic space and also remove noise from data • Noise: multiple words describing the same concepts • SVD is a factorization of a matrix into three matrices consisting of singular vectors and singular values corresponding to the original matrix

  40. Singular value decomposition (SVD) • SVD of a GF matrix • Columns of matrix G (F) are left (right) singular vectors of GF • S is a diagonal matrix of singular values si. • The values on the main diagonal are ordered in non-increasing order and represent variability in data

  41. Matrix approximation – dimensionality reduction • An approximated matrix can be computed by keeping only the first k largest singular values • We select k that retains the desired data variance (say x%) using the equation:

  42. Approximated matrix – column view • We approximate both reference and experiment matrices • The approximated experiment gene-function matrix is not a sub-part of the approximated reference gene-function matrix

  43. Correlation Between Functions • Indicates the strength and direction of a linear relationship between two biological processes • Pearson correlation coefficient rfi,fj between a pair of functions fi and fj is computed as: • Matrices (RRNxN and REnxn) of correlation coefficients are computed for reference and experiment data (respectively)

  44. Pair-wise Correlation Coefficients for Reference and Experiment data • RRnxn contains the pair-wise correlation coefficients between the first n functions in the absence of phenotype =

  45. Fisher Z Transform – Correlation Coefficient To Z-values • Correlation coefficients from samples of large population can be mapped to z values using Fisher z-transform, which approximates normal distribution • For a correlation coefficient r, the Fisher z-transform Zr can be computed as: • Compute ZRr from RRNxN and ZEr from REnxn

  46. Detecting Changes Between Functional Interactions • Hypothesis:Correlation between two biological processes in the given phenotype differs from the correlation in the reference data Hypothesis Test statistic

  47. Outline • Biological background • Motivation and problem description • Goals, Challenges and limitations • Mathematical background • Detecting changed interactions between biological processes in a phenotype • Improvements • Results

  48. Improvements • The dependencies between GO terms can somewhat be removed using weights in our matrix.

  49. Scheme 1-1 • This is a binary scheme and was discussed while describing our main method

  50. Scheme 1-e • ei is the normalized log-transformed fold-change measured for gene gi in the given condition

More Related