1 / 21

Data mining with the Gene Ontology

Grup de Recerca en Estadística i Bioinformàtica. Data mining with the Gene Ontology. GO ing into Biological Meaning. Josep Lluís Mosquera April 2005. Motivation. High throughput methodologies pose different challenges : The experiment in itself Statistical analysis of results

royal
Download Presentation

Data mining with the Gene Ontology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grup de Recerca en Estadística i Bioinformàtica Data mining with the Gene Ontology GOing into Biological Meaning Josep Lluís Mosquera April 2005

  2. Motivation • High throughput methodologies pose different challenges: • The experiment in itself • Statistical analysis of results • Biological interpretation • In gene-expression microarray studies, independently of the technology or analysis methods used, one generally obtains long lists of genes. QUESTION: What does this mean?

  3. Rationale • Bioinformatics resources hold data, often in the form of sequences which are annotated in scientific natural language. • The annotation in this form, is human readable and understandable, but it isn’t easy to interpretate computationally. PROBLEM:The lack of a common set of terms and descriptions which is common to all organisms.

  4. What can we do? • An ontology provides a set of vocabulary terms covering a conceptual domain. These terms: • Must: • have a definition • be placed within a structure of relationships • May have one or more parents. • May be linked by two kind of relationships: • ‘is-a’ between parent and child • ‘part-of’ between part and whole • In this context, the Gene Ontology(GO) is a very useful resource for the initial interpretation of gene lists.

  5. Gene Ontology Consortium

  6. But... what’s the GO? • It is an ontology with clear definitions of its terms and relationships between them starting at the top level (GO) whose children are three independent ontologies. GO Molecular Functions (MF) Biological Processes (BP) Cellular Components (CC)

  7. Graphical Overview • There are more than 16K nodes in GO

  8. GO database • Consist of two essential parts: • The current ontologies: • Vocabulary • Structure • The current annotations: • Create a link between the known genes and the associated GOs that define their function. THE CHALLENGE:Use annotations and structure of the GOs to understand the biological meaning in a large dataset of genes.

  9. Genes and GO terms • Each gene can have several associated GO terms • Each GO term can be connected to several other GO terms higher these are associated with the gene too. • We call: • path the list of GO terms between the root and the annotated GO term. • split each GO term in the path.

  10. Our context • A list of 100 genes will usually have many hundreds of associated GO terms and several thousand associated splits. OBJECTIVE: How to cast biological meaning to gene lists from differentially expressed genes through of the Gene Ontology (GO)

  11. Statistical Methods • Let us consider: • N genes on a microarray: M belong to a given GO term category (A) M-N do not belong it (category Ac) • K of the N genes are selected and assigned to a given class (e.g. regulated genes) • x genes of these K will be in A (EXAMPLE) STATISTICAL HYPOTHESIS: H0:GO category A is equally represented on the microarray than in the class of differentially regulated genes H1:GO category A is higher (or lower) represented on the microarray than in the class differentially regulated genes

  12. Hypergeometric Distribution (1/2) We ask: Assuming sampling without replacement,what is the probability of having exactly x genes of category A? • The probability that certain category occurs x times just by chance in the list of differentially regulated genes is modelled by a hypergeometric distribution with parameters (N, M, K).

  13. Hypergeometric Distribution (2/2) • So, under the null hypothesis p_value of having x genes or larger in A will be: • This corresponds to a one-side test in which small p_values relate to over-represented GO terms. • For under-represented categories can be calculated as 1 - p_value

  14. Disadvantages • The hypergeometric distribution is rather difficult and time consuming to calculate when N is high. • We can proof, • Using this approximation the p_value for over-represented GO terms can be calculated as

  15. Alternative approaches • Let us assume where N=N.., M=N1.,K=N.1 and x=n11 • Using this notation, alternative include: • test for equality of two proportions • Fisher’s Exact Test

  16. Chi-square Test (2) • statistic can be calculated as • PROBLEMS:It cannot: • Distinguish between under- and over-represented gene categories. • Be used for small samples, i.e. when

  17. Fisher’s Exact Test • This test consider fixed the marginal totals and uses the hypergeometric distribution to calculate the probability of observing each individual table as: • One can calculate a table containing all possible combinations of n11n12n21n22. • The p_value for a particular occurrence is the sum of all probabilities lower than or equal to the probability corresponding to the observed combination.

  18. Correction for Multiple Tests • As the number of GO terms for which test significance is large, p_values have to take the correction for multiple tests in account. For instance: • Methods controlling False Discovery Rate (FDR): • Benjamin and Hochberg (assuming independence) • Benjamin and Yekutieli (dropping independence) • Methods controlling Family Wyse Error Rate (FWER): • Holm correction • Westfall and Young

  19. Example N=9177 genes on microarray A M=467 in GO category A N-M=8710 in Ac Ac x= 51 genes of category A K=173 genes picked randomly

  20. Miguel.... GO!

More Related