Exercise 5: Looking for overreprsented GO terms in a gene set using Onto-Express

Exercise 5: Looking for overreprsented GO terms in a gene set using Onto-Express GO annotations can be used to obtain functional data about gene sets e.g. from gene expression experiments. There are several tools available to perform this sort of analysis, all developed by groups outside of the GO Consortium. They work in a similar way: a full gene set is uploaded, with a subset of all ‘interesting’ genes, usually those that have been up- or down-regulated in an expression experiment. The tool then determines which GO categories have been enriched for ‘interesting’ genes, and provides some sort of statistical measure to guard against GO categories that appear by chance alone. For a full list of these tools, see: http://www.geneontology.org/GO.tools.annotation.shtml. This tutorial will be using one such tool, Onto-Express, one of a package of microarray tools, Onto-Tools, developed at the Intelligent Systems and Bioinformatics Laboratory, Wayne State University (http://vortex.cs.wayne.edu/). First you need to obtain an Onto-Express login. Go to: http://vortex.cs.wayne.edu/ontoexpress/servlet/UserInfo and fill in the details. Your password will be emailed straight to you. Once your password has arrived, go to to the login page: http://vortex.cs.wayne.edu/ontoexpress/ fill in your login details and click ‘Submit’. You will see a security pop-up, choose ‘Grant Always’. A second pop-up will appear: Choose Onto-Express and click Run.

Note: You will need to leave the original browser window open for the whole of your session. We have provided some test microarray data for this tutorial; to download it, go to: http://www.ebi.ac.uk/~jane/pombe/ and download both files (sp_changed.txt and sp_total.txt). Tip: to download the files, right-click and choose ‘Save Link As’. Remember where you saved it to! The input file format is a simple text file with either accession numbers, cluster identifiers or probe identifiers (but not a combination), each listed on a separate line. Return to your input window, which should look like this: Click the ‘Input File’ button and browse for the file sp_changed.txt that you’ve just downloaded. This file contains a list of genes that were under- or over-expressed in the experiment. Now choose ‘schizosaccharomyces pombe’ from the Organism menu, and for Input Type choose ‘sanger genedb’; this chooses the format of the gene list. From the Reference Array menu choose ‘My own array’ and then click the Reference file button to browse for the file sp_total.txt. This file contains a list of all of the genes on your chip. Leave all the other setting as they are and click ‘Submit’. If you used a commercial chip in your experiment, you can choose this from the Reference Array list rather than uploading your own reference file.

After a minute or so, a results page will appear: From the Display options in the top right, choose Display ‘Biological Process’ and Sort by ‘Total’. Total is the number of genes associated with the GO term. Q: Which GO biological process term has the most genes associated with it? Is the over-representation of this term statistically significant? Q: Which molecular function and cellular component GO categories have the most genes associated with them? Hint: To switch between which ontology is shown in the flat view, use the Display options. Click the ‘Tree View’ tab. The number in bold following the term is the number of genesfrom the ‘interesting’ subset that are annotated to this term, and its child terms, in the same way as AmiGO. Open the nodes ‘biological process’ -> ‘development’ -> ‘morphogenesis’ -> ‘cellular morphogenesis’ -> ‘extablishment and/or maintenance of cell polarity’. You will see a list of genes annotated directly to ‘extablishment and/or maintenance of cell polarity’. Clicking on the gene names gives you the option of viewing the gene details. Now click the ‘Syncronized View’ tab. This shows you a flat view of all open nodes in the tree view. Q: Which is the only GO category that has a significantly different number of genes associated than you would expect? Why do you think this is? Q: What processes, functions and cellular components seem to be associated with these microarray data?

Exercise 5: Looking for overreprsented GO terms in a gene set using Onto-Express