1 / 32

GOSt a Gene Ontology mining tool Jüri Reimand

GOSt a Gene Ontology mining tool Jüri Reimand. Overview. Introduction, bioinformatics Gene Ontology (GO) GOSt, a Gene Ontology mining tool Statistics and thresholds Ordered gene lists Extending GO. cluster similar profiles. measures over time. Introduction. Bioinformatics

Download Presentation

GOSt a Gene Ontology mining tool Jüri Reimand

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GOSt a Gene Ontology mining tool Jüri Reimand

  2. Overview • Introduction, bioinformatics • Gene Ontology (GO) • GOSt, a Gene Ontology mining tool • Statistics and thresholds • Ordered gene lists • Extending GO

  3. cluster similar profiles measures over time Introduction • Bioinformatics • Analysis of experimental data • Genes encode proteins • Proteins : building blocks of living organisms • Gene expression : protein production from genetic code • Microarray experiments measure gene expression • Thousands of genes simultaneously • Expression levels over time • Different biological conditions • Comparison of healthy and diseased cells

  4. “steroid metabolism” “biosynthesis” “iron ion binding” Introduction • Biological experiments give large amounts of data • Groups of similar genes: • top “most active” genes • similar expression profiles over time • Many genes have some available annotations • Previous knowledge from databases • How to describe the group as a whole? • What are the common features? • Which features are significantly overrepresented?

  5. Gene Ontology (GO) • GO - Directed Acyclic Graph (DAG) • Vertices: terms • Edges: relations between general and specific terms • Hierarchically structured vocabulary • 3 DAGs: processes, components, functions • Annotations to vocabulary terms • Association between a gene g and a property t (GO term t) • Based on biological discoveries • Genes of many genomes are annotated to GO • Annotation sets : for a fixed organism • All genes associated with GO term t

  6. GO example • Graph fragment with some terms related to organ development • Vocabulary is general to living organisms • Gene annotations organism-specific • True Path Rulehierarchical annotations ENSG00000163217ENSG00000161202

  7. GO example • Graph fragment with some terms related to organ development • Vocabulary is general to living organisms • Gene annotations organism-specific • True Path Rulehierarchical annotations ENSG00000163217ENSG00000161202

  8. GOSt – Gene Ontology Statistics • GO annotations to groups of genes • Statistical significance of results • Thresholds for distinguishing significant results • Analysing ordered lists of genes • Visualisation methods, WWW interface • Command line toolset for large-scale analysis

  9. GOSt example

  10. 45 mouse genes 338 GO

  11. Evidencecodes Genes GOterms P-value

  12. GO Term Query Gt Gt Gt Gq Gq Gq e.g. heartdevelopment Annotations to gene groups • Result: term tmatches query Q

  13. Statistical significance • Is intersectionQ∩T significant? • Fisher's one-tailed test • Cumulative hypergeometric probability • Get observed or more genes in intersection Q∩T • P ( pick k white balls out of K white and N-K black balls ) • Multiple testing • Every query results in a number of p-values • Matching GO terms are not independent • Increased rate of false positive matches • Which p-values are significant?

  14. Experimental thresholds • Simulation experiment • Fix some gene query size k • Repeat 1000 times: • Generate synthetic query Q with k elements :random subset of organism's genes • Observe best p-value p for query Q • Store p-value, p --> P • Choose p', 50th smallest p-value from P • Threshold p' – top 5% of p-values for random queries of size k • Calculate for query lengths k = [1,1000] • Compare with standard multiple testing corrections • Bonferroni (1936), Benjamini-Hochberg (1995)

  15. Analytical thresholds • Analytical approach to simulated thresholds • Fix gene query size k • Observe all sizes and frequencies of GO annotation sets T • Presume events with different T independent • Observe possible p-values p with query of k elements • Always correct p by constant c=0.97 (set dependencies!) • Find such threshold p', that gives p ~= 0.95 • Repeat for query lengths k = [1,1000]

  16. Significance thresholds

  17. Significance thresholds

  18. Significance thresholds

  19. Significance thresholds

  20. Ordered lists of genes • Gene groups may be ordered • Interesting gene and few most similar genes • Top “most active” genes • Increasing distance from cluster centre • Top of the list, but how many? • Compare list with GO term • Which portion gives best p-value? • Peak significance of ordered query

  21. GOSt algorithms • Unordered query • Intersections with all annotation sets T • Exhaustive algorithm for ordered queries: • intersections with all Qi and annotation sets T • Approximate algorithm for ordered queries: • for every annotation set T, view only list portions that give local p-value extremes • local best p : list ends with matching gene • local worst p : list ends just before matching gene

  22. Peak significance at ordered list of 28 genes p-value query length List of genes, and matches for “Biosynthesis of steroids” Example: Ordered list analysis

  23. Evidencecodes Genes GOcategories P-value Ordered list query

  24. 24 sec 2.8 sec Algorithm speed comparison

  25. GOSt features • Command line interface (C/C++ and Perl) • Graphical user interface in web http://bioinf.ebc.ee/GOST • SWOG (Graphics language, Jaanus Hansen 2005) • Data for multiple organisms • yeast, chicken, cow, mouse, rat, human... • Wrappers for parallel applications (GRID, MPI) • Pipelines for gene expression data analysis

  26. GO KEGG:00000 KEGG pathways Extending GO ( i ) • Pathway – a network of interacting genes and proteins • metabolism pathways, disease pathways, .. • Include pathway data to GO vocabulary • KEGG Pathway database • pathways as vocabulary terms • related genes as annotations to terms • KEGG terms independent of GO vocabulary GO:0003674 molecular_function GO:0005575 cellular_component GO:0008150 biological_process

  27. KEGG:05010 - Alzheimer's disease

  28. TF binding site gene Extending GO ( ii ) • Gene expression started by transcription factors (TF) • TFs bind to certain patterns in DNA • Transcription Factor Binding Sites (TFBS) • Often found in regions close to gene (1k bp) • Include TFBS data from TRANSFAC • Patterns (putative TFBS) as vocabulary terms • annotations to genes near patterns Transcription factor ATATAATAAAGATGAGGCGAATATAAATATACCGGCCCTTAGCGCGAAGCAATTCATCATATAAGCGAGAGAGGCCAATATGCAATCTTCGACAGCAT

  29. Motifs added in a hierarchy according to PWM score 5 levels: near_threshold ... near_MAX_score Work in progress Hedi Peterson depth in hierarchy TF:M00000 TRANSFAC motifs GO KEGG:00000 KEGG pathways TRANSFAC motifs TF:M00431_4 TTTSGCGS:4 TF:M00431_3 TTTSGCGS:3 TF:M00431_2 TTTSGCGS:2 TF:M00431_1 TTTSGCGS:1 TF:M00431_0 TTTSGCGS:0 TF:M00328_4 NCNNTNNTGCRTGANNNN:4 TF:M00328_3 NCNNTNNTGCRTGANNNN:3 TF:M00328_2 NCNNTNNTGCRTGANNNN:2 GO:0003674 molecular_function GO:0005575 cellular_component GO:0008150 biological_process

  30. Summary • We investigated means for finding GO annotations to groups of genes, and statistical methods for determining significance of results. • We combined GO vocabulary with various types of biological data, such as KEGG pathways and TRANSFAC regulatory elements. • We proposed analytical thresholds for distinguishing significant results from structured and partly dependent GO annotations, and verified thresholds with simulation experiments. • We proposed a novel concept of analyzing GO annotations for ordered lists of genes, and implemented fast algorithms for the purpose. • The practical result of our work is GOSt, a GO mining tool. Command line interface is suitable for large-scale automatic analysis, while graphical web interface enables highly visualized and interactive analysis.

  31. Sneak preview • GO analysis of hierarchical clustering tree • Cluster genes according to expression similarity and .. • .. “Wrap up” nodes that show no significant annotations in GO • Work in progress • Meelis Kull • Darja Krushevskaja

  32. Acknowledgments Jaak Vilo BIIT group Hedi Peterson Raivo Kolde Meelis Kull Konstantin Tretjakov Jaanus Hansen Pavlos Pavlidis Priit Adler Asko Tiidumaa Ilja Livenson Darja Krushevskaja FunGenES Consortium

More Related