470 likes | 777 Views
Gene Ontology Analysis. Dr. Lars Eijssen. Contents. Gene ontology annotations The gene ontology tree Gene ontology based analysis. Part 1:. Gene ontology annotations. Gene Ontology.
E N D
Gene OntologyAnalysis Dr. Lars Eijssen
Contents • Gene ontologyannotations • The gene ontology tree • Gene ontologybasedanalysis
Part 1: Gene ontologyannotations
Gene Ontology • The Gene Ontology (GO) project gives a consistent description of gene products from different databases • GO consortium: http://www.geneontology.org
Protein annotation with GO terms • Cellular component • nucleus • Chromosome • DNA topoisomerasecomplex • Molecularfunction • chromatin binding • DNA topoisomerase activity • DNA-dependent ATPaseactivity • Biologicalprocess • DNA replication • DNA topological change • DNA ligation • DNA repair Human DNA topo- isomeraseIIA (P11388)
. . . Entrez Gene
Part 2: The Gene ontology TREE
Part 3: Gene ontologybasedanalysis
Basicprinciple • The principle is the same as withbiologicalpathwayanalysis • Find the termsthatcontain the relativelyhighestnumber of significantlychangedgenes
What’s different • In Gene Ontologyanalysisthere is a high redundancy of terms • Alsoit is a tree structure • These must be taken into account…
Gene Ontology analysis tools • GO-Elite • topGO • David/EASE Note: there are many more, but these illustrateseveralapproaches the European Nutrigenomics Organisation
How to deal with redundant nodes in a tree? • Only keep the ‘best’ node in eachbranch in the results • How to determine the best? • Severalways…
TopGO analysis • TopGO(bioconductor) • Integrates the knowledge about the relationship between GO terms (BP, MF, CC) for the calculation of statistical significance (Alexaet al., 2006). • Test statistics • Fisher`s exact test (define threshold i.e. FDR<0.05) • Kolmogorov Smirnov (KS) test (looks at distribution of P values) • GO scoring algorithms (classic, elim, weight) • classic scores each node independent • elim scores nodes bottom up, scores parent nodes after elimination of genes present in significant child node • weight scores nodes bottom up, assigns weights to genes based on P values obtained for each node Slide from: Caroline Reiff, RRI, Aberdeen
Load limma table Enter threshold (P value or FDR) Enter cdfname topGO Slide from: Caroline Reiff, RRI, Aberdeen
Scoring the tree (I) This node • Classic: This node plus subtreethese values are used to score! (because the genesbelong in factto that term as well) 2/20 (20/100) 5/10 (7/30) 3/25 (11/50) 2/20 1/15 7/10 Suppose all the boldvalues are significant The classic algorithmwould return all these processes!
Scoring the tree (II) • However, itwouldbebetter to only return the best term in everybranch • Best couldmean: the most specific significant one • Thiscanbeachievedbyremovinggenesthat are present in significant childleaves, from the parent’s score • Elim does this: 2/20 (20/100) 5/10 (7/30) 3/25 (11/50) (4/40) 2/20 1/15 7/10
TopGO analysis output Example results table for elim Fisher test(top 15 GO biological processes) Slide from: Caroline Reiff, RRI, Aberdeen
A GO Graph (squares= 15 most significant GO Ids) Slidefrom: Caroline Reiff, RRI, Aberdeen
Scoring the tree (III) • Anotheroption to score branches wouldbe to compute the significance of eachleavejust as the classic algorithm • Hereafter, foreverybranch the most significant leave is the onethat is reported back
GO_Elite • Smart algorithm • Produces full and prunedresults • Runs on Windows and Linux • Under development
Go_Elite Searches relationships in a hierarchical nature Identifies most significant scoring GO term: with higher score than all sibling terms For sibling terms, if one sibling branch scores higher than the parent and another branch does not, the highest scoring term from the latter sibling branch is also selected for the GO-Elite output, but the parent term is not
GO analysis versus pathway analysis • Biologicalpathwayscontain more information, GO classes are just sets of genesthatshareanannotation • And pathwaysallowbetter data visualisation • Pathways are generally more curated • GO classes are howeverorganisedin a tree, biologicalpathways are (in practice) not • GO classes are also more uniformly covering the space of biologicalprocesses, pathwayanalysisdependsheavilyon the pathwaysthat have been contributed/added • GO also covers cellularlocalisation and biochemicalfunction
Thanks! • Questions?