240 likes | 395 Views
J.L. Mosquera and Alex Sanchez. Introduction to Functional Analysis. Motivation. The rise of the genomic era and especially the deciphering of the whole genome sequences of several organism has represented huge quantities of information.
E N D
J.L. Mosquera and Alex Sanchez Introduction to Functional Analysis
Motivation • The rise of the genomic era and especially the deciphering of the whole genome sequences of several organism has represented huge quantities of information. • New technologies such as DNA microarrays (but not only these!) allow the simultaneous study of hundreds, even thousands of genes, in a single experiment.
Motivation • This represents different challenges: • The experiment in itself • Statistical analysis of results • Biological interpretation • Very often the results are large-lists of genes which have been selected according to some specific criteria. PROBLEM:How could a researcher give these sets a biological interpretation?
Rationale • A reasonable thing to do is to rely on existing annotations which help to relate the selected sequences with biological knowledge. • Bioinformatics resources hold data, often in the form of sequences which are annotated in scientific natural language. • The annotation in this form is human readable and understandable, but difficult to interpret computationally.
What’s in a name? QUESTION: What’s a cell? • The same name can be used to describe different concepts • A concept can be described using different names • Comparison is difficult, especially across species or across databases Image from http://microscopy.fsu.edu
Functional annotation • Probably, the most important thing you want to know is what the genes or their products are concerned with, i.e. their function. • Function annotation is difficult: • Different people use different words for the same function, • may mean different things by the same word. • The context in which a gene was found (e.g. “TGF-induced gene”) may not be particularly associated with its function. • Inference of function from sequence alone is error-prone and sometimes unreliable. • The best function annotation systems use human beings who read the literature before assigning a function to a gene
What can we do? To overcome some of the problems, an annotation system has been created: The Gene Ontology (GO).
What is an ontology? • An ontology is an entity which provides a set of vocabulary terms covering a conceptual domain. • These terms must • have an exhaustive and rigorous definition, • be placed within a structure of relationships. It usually is a hierarchical data structure. • The terms may be linked by two kind of relationships: • “is-a” between parent and child. • “part-of” between part and whole. • They may have one or more parents.
What’s the GO? • The GO is a cooperative project, developed and maintained by the Gene Ontology Consortium. • It is an annotation database created to provide a controlled vocabulary to describe gene and gene product attributes in any organism. • It is organized around three basic ontologies: 1May, 2005
The GO ontologies and the GO graph Biological Processes (BP) GO Cellular Components (CC) Molecular Functions (MF)
Genes and GO terms A given gene product may • represent one or more molecular functions, • be used in one or more biological processes and • appear in one or more cellular components.
GO database • Consist of two essential parts: • The current ontologies: • Vocabulary • Structure 2) The current annotations: • Create a link between the known genes and the associated GOs that define their function. • The GO database exists independently from other annotation databases • It does not depend on the organism • It does not depend on other databases, but • Most important databases have cross-references with the GO databases • It is possible to link and relate other annotations with those contained in GO
Two types of GO Annotations Electronic Annotation Manual Annotation • All annotations must 1)be attributed to a source, 2) indicate what evidence was found to support the GO term-gene/protein association
Evidence Codes IEA Inferred from Electronic Annotation ISS Inferred from Sequence Similarity IEP Inferred from Expression Pattern IMP Inferred from Mutant Phenotype IGI Inferred from Genetic Interaction IPI Inferred from Physical Interaction IDA Inferred from Direct Assay RCA Inferred from Reviewed Computational Analysis TAS Traceable Author Statement NAS Non-traceable Author Statement IC Inferred by Curator ND No biological Data available
Enrichment Analysis • Unbiased method to ask question, “What’s so special about my set of genes?” • Many tools follow similar steps • Obtain GO annotation (most specific term(s)) for genes in your set • Climb an ontology to get all “parents” (more general terms) • Look at occurrence of each term in your set compared to terms in population (all genes or all genes on your chip) • Are some terms over-represented?
Statistical Methods for enrichment analysis • Let us consider: • N genes on a microarray: M belong to a given GO term category (A) M-N do not belong it (category Ac) • K of the N genes are selected and assigned to a given class (e.g. regulated genes) • x genes of these K will be in A (EXAMPLE) STATISTICAL HYPOTHESIS: H0:GO category A is equally represented on the microarray than in the class of differentially regulated genes H1:GO category A is higher (or lower) represented on the microarray than in the class differentially regulated genes
Hypergeometric Distribution (1/2) We ask: Assuming sampling without replacement,what is the probability of having exactly x genes of category A? • The probability that certain category occurs x times just by chance in the list of differentially regulated genes is modelled by a hypergeometric distribution with parameters (N, M, K).
Hypergeometric Distribution (2/2) • So, under the null hypothesis p_value of having x genes or larger in A will be: • This corresponds to a one-side test in which small p_values relate to over-represented GO terms. • For under-represented categories can be calculated as 1 - p_value
Disadvantages • The hypergeometric distribution is rather difficult and time consuming to calculate when N is high. • We can proof, • Using this approximation the p_value for over-represented GO terms can be calculated as
Alternative approaches • Let us assume where N=N.., M=N1.,K=N.1 and x=n11 • Using this notation, alternative include: • test for equality of two proportions • Fisher’s Exact Test
Fisher’s Exact Test • This test consider fixed the marginal totals and uses the hypergeometric distribution to calculate the probability of observing each individual table. • One can calculate a table containing all possible combinations of n11n12n21n22. • The p_value for a particular occurrence is the sum of all probabilities lower than or equal to the probability corresponding to the observed combination.
Correction for Multiple Tests • As the number of GO terms for which test significance is large, p_values have to take the correction for multiple tests in account. For instance: • Methods controlling False Discovery Rate (FDR): • Benjamin and Hochberg (assuming independence) • Benjamin and Yekutieli (dropping independence) • Methods controlling Family Wyse Error Rate (FWER): • Holm correction • Westfall and Young
Example N=9177 genes on microarray A M=467 in GO category A N-M=8710 in Ac Ac x= 51 genes of category A K=173 genes picked randomly