700 likes | 831 Views
Blast2GO presentation @ StatSeq COST workshop. 21 nd -23 rd April 2013, Helsinki, Finland. Friday 25 th January 2013, Royal Melbourne Hospital. Why Blast2GO. Functional characterization of novel sequence data. Adapted of high throughput needs of biological laboratories.
E N D
Blast2GO presentation @ StatSeq COST workshop 21nd -23rdApril2013, Helsinki, Finland Friday 25th January 2013, Royal Melbourne Hospital
Why Blast2GO Functional characterization of novel sequence data Adapted of high throughput needs of biological laboratories Extracting knowledge about functioning of genomes
Outline • Concepts on Functional Annotation • The Blast2GO annotation framework • Visualization of functional data • Pathway analysis with Blast2GO
Concepts of Functional Annotation What is functional annotation? How to annotate a large dataset?
The Gene Ontology • Threebranches: • BiologicalProcess • Molecular Function • CellularComponent • Annotations are givento te mostspecific(low) level • Truepathrule: annotation at a giventermimpliesannotationtoallitsparentterms • AnnotationisgivenwithanEvidenceCode: • IDA: inferred by directassay • TAS: traceableauthorstatement • ISS: infered by sequencesimilarity • IEA: electronicannotation • …. More general More specific
Functionalassignment Annotation Empirical Transference Literature reference Phylogeny Molecular interactions Biochemical assay Sequence analysis Structure Comparison Sequence homology Gene/protein expression Identification of folds Motif identification
Annotationbysimilarity: concerns GO1, GO2, GO3, GO4 HIT QUERY GO1, GO2, GO3, GO4 Level of homology (~ from 40-60% ispossible) Theoverlapbetween hit and query, associationfunction and structure Theparalogproblem: genes with similar sequences mighthavedifferentfunctionalspecifications Theevidenceforthe original annotation Balancebetweenquality and quantity: dependsonthe use
Application scheme cellular component biological process Fasta
Application scheme biological process cellular component Fasta
Application scheme cellular component biological process Fasta
Basic annotation procedure Hit1 Hit1 go1,go2, go3 go1,go3, go4 Hit2 Hit2 go3,go5, go6,go8 Hit3 Hit3 go1,go4 Hit4 Hit4 Hit1 Hit1 go6,go9, go8 go6,go9,go8 go1,go8 go1,go8 Hit2 Hit2 go4,go1, go8,go9 go4,go1,go8,go9 Hit3 Hit3 Hit4 Hit4 Hit1 Hit1 go2 go2,go4, go4 Hit2 Hit2 go2,go5, go6 Hit3 Hit3 go2,go4 Hit4 Hit4 Hit1 Hit1 Hit2 Hit2 go1,go2, go3 go1,go3, go4 Sq1 Sq1 Sq1 Sq1 go3,go5, go6,go8 go1,go4 Sq2 Sq2 Sq2 Sq2 Blast Mapping Annotation go2 go2,go4, go4 Sq3 Sq3 Sq3 Sq3 go2,go5, go6 go2,go4 Sq4 Sq4 Sq4 Sq4
Annotation Rule • Let be GO1…n be candidateannotationsforsequence S1, obtainedfrom hits Hi…k • We compute anannotationscore AS foreachGOithatdependson: • Thesimilaritybetweensequence S1andHj • TheevidencecodeofGOi • Theexistenceofotherneigboring GO candidates • Thestructureofthe Gene Ontology • We define anabritaryannotationthreshold(AT) • S1isannotatedwithGOiifitsASGOi > AT
Annotation Rule Possibility of abstraction Similarity Requirement GO4 GO1 GO2 GO3 Quality of source annotation: IEA=0.7, IDA = 1, NR = 0.0, ... Annotation Score selectivity vs. specificity Cut-Off Value new annotation True-Path-Rule
Blast2GO annotation rule - When I have a GO withECw =1and I do notallowabstraction (GOw = 0), thentheAnnotationScore = %similarity - IftheECw< 1 my similarity requirement is higher to obtain the same Annotation Score - If I allow abstraction GOw > 0, then with less similarity I can obtain the required Annotation Score at a parent node
Blast2GO Application (1) Blast (2) Mapping (3) Annotation Main Sequence Table Any operation will only affect to selected sequences!!!! Application statistics Blast results Application messages Graph visualisation
Input data (in FASTA format, AA or nt) >my_favourite_species_seq1 | still unknown gtgatggaaaagaaaagttttgttatcgtcgacgcatatgggtttctttttcgcgcgtattatgcgctgcctggattaagcacctcatacaattttcctgtaggaggtgtatatggttttataaacatacttttgaaacatctctctttccacgatgcagattatttagttgtggtatttgattcggggtcgaaaaattttcgtcacactatgtattccgaatacaaaactaatcgccctaaagcaccagaggatctgtcactacaatgtgctccgctacgtgaggctgttgaagcgtttaatattgtaagtgaagaagtgcttaactacgaagcagacgacgtaatagctacactctgtacaaaatatgcatctagtaatgttggagtgagaatactgtcagcagataaggatttactacaactcctaaatgataatgttcaagtttacgaccctataaaaagcagatacctcaccaatgaatacgttttagaaaaatttggtgtttcatcagataagttgcatattgatacggttgcatcgagttataatgagaaaattattctcagctaagctgtacaccgtttattacacactcgaaaggccgttag >my_favourite_species_seq2 | no clue ttgttagctaaaaaggaagactttcacacctttggtaatggtgttggctctgctggaacaggtggagttgtagtttctgcatccatgttgtctgcggatttttcaaatcttagagaagagatagcagcggttagtacggctggtgcagattggttacacattgatgtgatggatgggtgcttcgtccccagtttgactatgggtcctgtggtgatttccggcattaggaaatgtacaaatatgtttcttgatgtgcatttgatgattaatcgcccaggcgatcatctgaagagtgtggtagatgctggagctgataagatagagcacattcgcaagatgatagaggaaagctcatcaaccgcgaaaatcgctgttgatggtggtgtttcaacggataatgcccgggctgttatcgaggcaggtgcgaatatactcgttgttggaacggcgctgtttgctgctgacgatatgagtaaagttgtaagaactttaaaatcattttaa >my_favourite_species_seq3 | just sequenced gtgggactgctcatccctgtaggcagggtggctattttttgtgtaaaggcagtctttcatagtcttgtaccgccatactatctatggataactacaaagcagttttttgaggtgtggtttttctctcttcctatagtagcagttacatctttgtttacgggaggcgcgttagcccttcaggataccctcgtgggaagcgctaaagtatcagggtaatggagtttttactcctgcaagatgtaatagagggtctggtaaaagctgtatcgtttgggctggtaatttcgctagttgggtgttacaacgggtatcactgtgagataggcgcaaggggtgtaggaacagcgacaacaaaaacttcggtagcagcttctatgctcataattttgttaaactatataattactgttttttacgcgta >my_favourite_species_seq4 | we will see soon... atgtacgctgtatctctttcaaatttgcatgtctctttcaacaacaaggaggttttgaaaggtgttgacttggacatagcatggggggattccctggttatactgggagaatctggtagtggaaagtctgtactaacaaaggttgtattgggtctaatagtgccccaagagggaagtgttactgtagatggcaccaatattcttgagaataggcagggcatcaagaattttagtgttttgtttcaaaactgtgcgttatttgacagtcttacgatttgggaaaatgtagtattcaatttccgtaggaggcttcgtttagataaggataatgccaaggctttggctttacggggattggagcttgtgggattggacgccagtgtaatgaacgtgtatcctgtggagctatcaggcgggatgaaaaagcgcgtagctttggcaagagctattataggtagtcccaaaattctaattttggatgagccaacttcgggattggatcctataatgtcttcagtggt asdf asdf
BLAST You email adress BLAST program (normally blastx) BLAST database (many options) E-Value (depends on the DB) Number of HITs (use <= 20) Recommended to save as XML Human readable seq. Descriptions via BDA
Additional BLAST params Set word size and filter Use your own server Minimum HSP length Filter by description Parsing options for own databases
BLAST Results RED
Blast Distribution Charts Evaluate the similarity of your sequences with public DBs
Single Sequence Menu Single Sequence Menu
Mapping Results GREEN
Annotation Menu BLAST based annotation Other Annotation modes Validation and Annex
Annotation Allows to set a minimum percentage of the HIT sequence which should be expand by the QUERY sequence This helps to avoid the problem of cis-annotation
Annotation Result BLUE
Annotation Charts Commonly, level 5 is the most abundant specificity level in the Gene Ontology
Additional Annotation: ANNEX Recovers implicit biological process and cellular component GO terms based on molecular function annotations Molecular Function acts in is involved in Biological Process Cellular Component Myhre et al, Bioinformatics 2006
Additional Annotation: InterProScan Runs InterProScan searches at the EBI through Blast2GO Results are stored at your computer as XML files. You can upload them later Once you have completed your InterPro annotation, results can be transformed to GO terms and merged to Blast annotation
InterProScan Results Column with InterProScan results
Additional Annotation: GOSlim GOSlim is a reduction of the Gene Ontology to a more reduced vocabulary → Helps to summarize information After GOSlim transformation sequences get YELLOW Different GOSlims available at Blast2GO
Enzyme annotation and Kegg Maps GO Enzyme Codes KEGG maps
Manual Curation You can modify manually annotation of particular sequences If you click in this box, curated sequences get purple
Export Results Saves the complete B2G project (heavy) Export annotation results in different formats
Export formats Also for import! .annot GeneSpring Format GoStat By Seq
More export formats Export Sequence Table Export BestHit Data
Sequence Selection Sequence Selection tool to obtain a selection based on annotation status
Sequence Selection By Function By Name/Description
View Menu Functions to switch between displaying IDs or descriptions for GO annotation or InterPro results
Hands-on I Annotation 10 seqs with Blast2GO
Visualization How to understand the functional context of a annotated dataset
Combined Graph Each term has a number of sequences associated Nodes can be coloured to indicate relevance Each term is displayed around its biological context Node shape to differentiate between direct and indirect annotation
Combined Graph Different GO branches Reduces nodes by number of annotate sequences Node data to be displayed Criterion for highlighting and filtering nodes
Accumulated by GO term (SequenceCount) 5 1 4 1 3 1 3 2.5 1 2.4 1 3 1 3 Node information content Σ seq(g)*αdist (g, g') g∈desc(g') Incomming information (Node Score)