Automatic methods for functional annotation of sequences

Automaticmethods for functionalannotation of sequences Petri Törönen

What, Why, How??? • Functionalannotation of sequence (seq.) • Definition of descriptionline • Mappingseq. to functionalcategories • Simplesolutionsareerror-sensitive • Reviewsomeavailabletools in the exercises

Old, simpleway • Do a SequenceSearch (SS), like BLAST, withyoursequence • Find the bestmatch • Transferall the info from the bestmatch to yoursequence • Everythingdone? Finished?

Problems • Firsthit is unknownseq. • Firsthit is misannotatedseq. • an increasingproblem!! • No significantmatchesfound • Strong, butonlylocalmatches => impurities in search • Inpurities in queryseq.

Why manual analysis is hard? • Largesize of genelists (SS resultlist) • Falsepositivesamongobservedresults

Why manual analysis is hard? • Eachgenecanhavemultiplefunctions -theimportant common themeamong the genescangoeasilyunnoticed. • Requiresdetailedknowledgeof genes • varyingrepresentations for samefunction in descriptionlines • Objectivity

Gene Ontology (GO) www.geneontology.org • A controlled vocabulary of gene product roles in cells and the role associations • The roles can be applied to all organisms • Three main hierarchies: biological process, cellular component and molecular function include currently about 19,000 classes (=roles) • -usuallyonly a smallportion of theseclasses is in usewithoneorganism (example: chloroplastsrelatedfunctionsareimportantonlywithinplants)

Structure of GO root of hierarchical structure GO graph: • Hierarchicalstructure of linkednodes -eachnodepresentsoneclassthat is part of itsparentalclass • Direct Acylic Graph (DAG) -a tree-structurewherebranchescanalsomergewhengoingfromparentalnodes to childnodes. Genescanbelinked to manyclasses in the GO structure Less detailedclasses Moredetailedclasses Starting node

How GO helps • GO presents a terminology for presentation of knowninformation of the gene • GO classifiesgenesaccording to theirknown/predictedfunctions • Classesrepresentvaryingdetail • Classificationscanbeused to findover-representedfunctions in the results

How GO helps • Look over-represented GO classesfrom the genelist wewouldlike to ask: what is the probability of observing the number of classmemberslikewehave in the clusterbyrandom? Solutionfrom the statistics is the samplingwithoutreplacement Sampling w/o replacements answers to: How many ways there are to select 8 balls so that two of them are white and rest are black from the whole data?

Methodsthatpredictproteinfunction • Methodsthatsummarize the SS resultlist • Methodsthatuseprofilesearches • Methodsthatusesequencefeatures • Methodsbased on sequencepatterns • Methodsbased on sequencephylogeny

SS listsummarization • Consensusanalysis of SS list • Do the SS • Look repetitivelyoccuringdescriptions /GO classes • Over-representation of GO classes (BLAST2GO) • Toolsperformingthis: • Ourmethod PANNZER (Koskinen et al. unpubl.) • BLAST2GO (http://www.blast2go.org/start_blast2go) • ConFunc

Profilesearchmethods • Useprofilesearchesinstead of SS • Somepositionsaremoreconserved in the seq. • PFAM http://pfam.sanger.ac.uk/ • ConFunchttp://www.sbg.bio.ic.ac.uk/~confunc/

ConFunc in detail • BLAST searchwithqueryseq. • Obtain a resultlist • Seq:s in resultlistareclustered to seq:swithsimilarfunction (same GO classes) • Eachcluster is used as a seed for a profilesearch • Testhowwell the queryseqmatches to eachprofile • Uselink: http://www.sbg.bio.ic.ac.uk/confunc/indextemp.cgi

Sequence feature methods • Look for sequencefeatures Features: Secondarystructure, proteindomains • Comparesequencesbylookingwhichfeaturestheyhave in common • Methodsthatdothis: FACT http://www.cibiv.at/FACT/ • Limited searchpossibilitieswith FACT

Sequencepatternmethods • Pattern => frequentlyobservedshortmotiffromseq. DB • InterProScan • BioDictionaryfrom IBM ComputationalBiology(http://cbcsrv.watson.ibm.com/Tpa.html) • Extraction of most of the patternsfromswissprot • Linking of eachpattern to keywords, seen in the seq:swherepatternwas • Queryseq. is linked to keywords via patternsithas

Phylogenybasedmethods • Shortly: Include the speciestree to the annotation of the sequences. • Evolutionarydistance is taken into account • Comparafrom ENSEMBL • http://www.ebi.ac.uk/GOA/compara_go_annotations.html

Tip for testing the tools • For testingwithpurelyrandomsequence • http://www.bioinformatics.org/sms2/random_protein.html • For testingpartiallyrandomsequence • http://www.bioinformatics.org/sms2/mutate_protein.html

Automatic methods for functional annotation of sequences