2.12k likes | 2.31k Views
EBI Roadshow. James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk. Sequence Searching and Alignments. Andrew Cowley External Services, EMBL-EBI. External Services. Andrew Cowley Bioinformatics Trainer. Hamish McWilliam Software engineer. Rodrigo Lopez
E N D
EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk
Sequence Searching and Alignments Andrew Cowley External Services, EMBL-EBI
External Services Andrew Cowley Bioinformatics Trainer Hamish McWilliam Software engineer Rodrigo Lopez Head of External Services + many others! Sequence searching and alignments - Andrew Cowley
Contents • Sequence databases • Database browsing tools • Similarity searching and alignments • Alignment basics • Similarity searching tools • More advanced tools • Alignment tools • Guidelines • (slightly) More advanced tools • Problem sequences Sequence searching and alignments - Andrew Cowley
Materials Presentations and tutorials can be found on the roadshow course page at the EBI Data files for exercises can be found at: www.ebi.ac.uk/~watson/africa Sequence searching and alignments - Andrew Cowley
Data Simplistically, much of the data at the EBI can be thought of as a container One part being the raw data (eg. Sequence) Another part being annotation on this data Sequence searching and alignments - Andrew Cowley
ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP. XX AC AJ131285; XX DT 24-APR-2001 (Rel. 67, Created) DT 20-JUL-2001 (Rel. 68, Last updated, Version 4) XX DE Sabellaspallanzanii mRNA for globin 3 XX KW globin; globin 3; globin gene. XX OS Sabellaspallanzanii OC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata; OC Sabellida; Sabellidae; Sabella. XX RN [1] RP 1-919 RA Negrisolo E.M.; RT ; RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases. RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. Bassi RL 58/B, Padova,35131, ITALY. FH Key Location/Qualifiers FH FT source 1..919 FT /organism="Sabellaspallanzanii" FT /mol_type="mRNA" FT /db_xref="taxon:85702" FT CDS 73..552 FT /gene="globin" FT /product="globin 3" FT /function="respiratory pigment" FT /db_xref="GOA:Q9BHK1" FT /db_xref="InterPro:IPR000971" FT /db_xref="InterPro:IPR014610" FT /db_xref="UniProtKB/TrEMBL:Q9BHK1" FT /experiment="experimental evidence, no additional details FT recorded" FT /protein_id="CAC37412.1" FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTA FT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLA FT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV" XX SQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtcarttaattcacagagccctgaggtctctcgctcctttctgcgtcactctct 60 cttaccgtcatcatgtacaagtggttgctttgcctggctctgattggctgcgtcagcggc 120 tgcaacatcctccagaggctgaaggtcaagaaccagtggcaggaggctttcggctatgct 180 gacgacaggacatcccycggtaccgcattgtggagatccatcatcatgcagaagcccgag 240 // Example Sequence searching and alignments - Andrew Cowley
Data - Nucleotide • ENA/EMBL-Bank: • Release and updates • Divided into classes and divisions • Supplementary sets: EMBL-CDS, EMBL-MGA • Specialist data sets, e.g.: • Immunoglobulins: IMGT/HLA, IMGT/LIGM, etc. • Alternative splicing: ASD, ASTD, etc. • Completed genomes: Ensembl, Integr8, etc. • Variation: HGVBase, dbSNP, etc. Sequence searching and alignments - Andrew Cowley
Individual sequencing ACTGCTGCTAGCTAG What sequence data is submitted? Individual scientists Sequence individual gene ACTGCTGCTAGCTGGCTGACTATTCTAGCTTTAGCTGAGTGACTATTATCAGCTATTACAGCATCCG add annotation submission submission
High throughput sequencing ACTGCTGCTAGCTAG chromosome fragment sequencing library sequence reads assemble sequence annotation cyp30 cyp309 insv cg343
High throughput sequencing ACTGCTGCTAGCTAG chromosome Large-scale sequencing projects fragment sequencing library submission sequence reads e.g. whole genome shotgun assemble sequence submission submission annotation cyp30 cyp309 insv cg343
What are primary sequence databases? ACTGCTGCTAGCTAG Individual scientists ACTGCTGCTAGCTAGCTGATCTATGCTAGC TGTAGCTGAG Large-scale sequencing projects Patent Offices Primary sequence data • Original sequence data • Experimental data • Patent data • Submitter-defined Primary sequence database
How do primary and derived databases differ? ACTGCTGCTAGCTAG Individual scientists ACTGCTGCTAGCTAGCTGATCTATGCTAGC TGTAGCTGAG Large-scale sequencing projects Patent Offices Primary sequence data Primary sequence database Derived database Derived data e.g. protein sequence
Primary v. derived data ACTGCTGCTAGCTAG submit DNA sequence ACGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACAT transcribe Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC translate Derived protein sequence MRSNECCCAMSC
How do primary and derived databases differ? ACTGCTGCTAGCTAG Individual scientists ACTGCTGCTAGCTAGCTGATCTATGCTAGC TGTAGCTGAG Large-scale sequencing projects Patent Offices If anything in submission varies (e.g. source / submitter / sequence) generates a new entry Primary sequence data may be non-redundant Primary sequence database Derived database Derived data e.g. protein sequence redundant
How do primary and derived databases differ? ACTGCTGCTAGCTAG Individual scientists ACTGCTGCTAGCTAGCTGATCTATGCTAGC TGTAGCTGAG Large-scale sequencing projects Patent Offices Primary sequence data regenerate data Primary sequence database Derived database Derived data e.g. protein sequence data lost
INSDC: • International Nucleotide • Sequence Database • Collaboration • Daily exchange of data Primary nucleotide sequence databases ACTGCTGCTAGCTAG DDBJ GenBank GenBank DDBJ ENA (U.S.A.) (Japan) Submission can be made to any INSDC database ENA (Europe)
Sequence information ACTGCTGCTAGCTAG How is sequence data processed? DDBJ GenBank ENA • Sequence machine output (reads) • Quality scores Reads • Fragmented sequence reads • assembled into contigs • mapped onto chromosomes Assembly Annotation • Functional information assigned to assembled regions
Sequence information ACTGCTGCTAGCTAG What type of sequence data is submitted? • Input information: • Sample • Set-up • Machine configuration • Output machine data: • Sequence traces • Reads • Quality scores • Metagenomic data: • Where originated DDBJ GenBank ENA Reads Annotated / Raw Raw data Assembled sequences Assembly • Interpreted information: • Assembly • Mapping • Functional annotation • Sample information Annotated sequence Annotation
European Nucleotide Archive ACTGCTGCTAGCTAG How does ENA store the data? DDBJ GenBank ENA Large-scale sequencing projects Annotated / Raw Trace Archive Ann SRA Trace ENA Raw data Sequence Read Archive (SRA) Individual scientists Assembled sequences ENA-Annotation (formerly EMBL-Bank) Annotated sequence Patent Offices
European Nucleotide Archive ACTGCTGCTAGCTAG How does ENA store the data? DDBJ GenBank ENA Large-scale sequencing projects • Trace sequence reads • Capillary sequencing • instruments Annotated / Raw Trace Archive Ann SRA Trace ENA Raw data Sequence Read Archive (SRA) • Intensity reads • Next-generation • sequencing instruments Individual scientists Assembled sequences ENA-Annotation (formerly EMBL-Bank) Annotated sequence Patent Offices
INDSC Sequencing Projects ACTGCTGCTAGCTAG Can data be traced to an Institute? DDBJ GenBank Complete genome / metagenome ENA Database records Pulls information together Annotated / Raw Ann SRA Trace genomic genomic Assembly & annotation Track projects ESTs... ESTs... Institute shotgun shotgun Comparative analysis Consortium Assembly & annotation (single organism / metagenomic study)
Nucleotides: European Nucleotide Archive (ENA) The ENA has a three-tiered data architecture. It consolidates information from EMBL-Bank, the European Trace Archive (containing raw data from electrophoresis-based sequencing machines) and the Sequence Read Archive (containing raw data from next-generation sequencing platforms). Figure adapted from: Cochrane, G. et al. Public Data Resources as the Foundation for a Worldwide Metagenomics Data Infrastructure. In: Metagenomics: Theory, Methods and Applications (Chapter 5), Caister Academic Press, Universidad Nacional de Cordoba, Argentina. Ed. D. Marco (2010). Sequence searching and alignments - Andrew Cowley
Data Quality ACTGCTGCTAGCTAG Is the data cleaned up? DDBJ GenBank ENA Validation of submitted data: Annotated / Raw Ann SRA Trace • Automatic quality checks Clean-up • Some manual inspection and curation Errors can still exist in sequence and annotation
Database Structure ACTGCTGCTAGCTAG How is the data organized? DDBJ GenBank ENA Data in ENA Annotation is divided in 2 ways: Annotated / Raw 1) Data classes Ann SRA Trace • Type of data or • Methodology used to obtain data • Each entry belongs to one data class Clean-up Class Taxon 2) Taxonomic Divisions • Each entry belongs to one taxonomic division
Data Classes ACTGCTGCTAGCTAG CON Constructed from sequence assemblies Expressed Sequence Tag (cDNA) DDBJ GenBank EST Genome Survey Sequence (high-throughput short sequence) ENA GSS Annotated / Raw HTC High-Throughput cDNA (unfinished) High-Throughput Genome sequencing (unfinished) HTG Ann SRA Trace Mass Genome Annotation MGA Clean-up Patent sequences PAT Sequence Tagged Site (short unique genomic sequences) STS Class Taxon Standard (high quality annotated sequence) STD Third Party Annotation (re-annotated and re-assembled) TPA Transcriptome Shotgun Assembly (computational assembly) TSA Whole Genome Shotgun WGS
Data Classes ACTGCTGCTAGCTAG CON Constructed from sequence assemblies Expressed Sequence Tag (cDNA) DDBJ GenBank EST Genome Survey Sequence (high-throughput short sequence) ENA GSS • Single pass reads variable quality • Need to search both EST and RNA data Annotated / Raw HTC High-Throughput cDNA (unfinished) High-Throughput Genome sequencing (unfinished) HTG Ann SRA Trace Mass Genome Annotation MGA Clean-up Patent sequences PAT Sequence Tagged Site (short unique genomic sequences) STS Class Taxon Standard (high quality annotated sequence) STD Third Party Annotation (re-annotated and re-assembled) TPA Transcriptome Shotgun Assembly (computational assembly) TSA Whole Genome Shotgun WGS
Data Classes ACTGCTGCTAGCTAG CON Constructed from sequence assemblies Expressed Sequence Tag (cDNA) DDBJ GenBank EST Genome Survey Sequence (high-throughput short sequence) ENA GSS Annotated / Raw HTC High-Throughput cDNA (unfinished) High-Throughput Genome sequencing (unfinished) HTG Ann SRA Trace Mass Genome Annotation MGA Clean-up Patent sequences PAT Sequence Tagged Site (short unique genomic sequences) STS Class Taxon • Often copies of existing entries • Records not clean, even for taxonomy Standard (high quality annotated sequence) STD Third Party Annotation (re-annotated and re-assembled) TPA Transcriptome Shotgun Assembly (computational assembly) TSA Whole Genome Shotgun WGS
Data Classes ACTGCTGCTAGCTAG CON Constructed from sequence assemblies Expressed Sequence Tag (cDNA) DDBJ GenBank EST Genome Survey Sequence (high-throughput short sequence) ENA GSS Annotated / Raw HTC High-Throughput cDNA (unfinished) High-Throughput Genome sequencing (unfinished) HTG Ann SRA Trace • Bulk of entries • Highest level of tracked information Mass Genome Annotation MGA Clean-up Patent sequences PAT Sequence Tagged Site (short unique genomic sequences) STS Class Taxon Standard (high quality annotated sequence) STD Third Party Annotation (re-annotated and re-assembled) TPA Transcriptome Shotgun Assembly (computational assembly) TSA Whole Genome Shotgun WGS
Data Classes ACTGCTGCTAGCTAG CON Constructed from sequence assemblies Expressed Sequence Tag (cDNA) DDBJ GenBank EST Genome Survey Sequence (high-throughput short sequence) ENA GSS Annotated / Raw HTC High-Throughput cDNA (unfinished) • Derived data entries • e.g. patch genomic and RNA data to construct complete coverage • Must have publication • Must show which entries data is derived from High-Throughput Genome sequencing (unfinished) HTG Ann SRA Trace Mass Genome Annotation MGA Clean-up Patent sequences PAT Sequence Tagged Site (short unique genomic sequences) STS Class Taxon Standard (high quality annotated sequence) STD Third Party Annotation (re-annotated and re-assembled) TPA Transcriptome Shotgun Assembly (computational assembly) TSA Whole Genome Shotgun WGS
Data Classes ACTGCTGCTAGCTAG CON Constructed from sequence assemblies Expressed Sequence Tag (cDNA) DDBJ GenBank EST Genome Survey Sequence (high-throughput short sequence) ENA GSS Annotated / Raw HTC High-Throughput cDNA (unfinished) High-Throughput Genome sequencing (unfinished) HTG Ann SRA Trace Mass Genome Annotation MGA • Also derived data entries • ESTs assembled to construct RNA • Must show which EST/HTC entries data is derived from Clean-up Patent sequences PAT Sequence Tagged Site (short unique genomic sequences) STS Class Taxon Standard (high quality annotated sequence) STD Third Party Annotation (re-annotated and re-assembled) TPA Transcriptome Shotgun Assembly (computational assembly) TSA Whole Genome Shotgun WGS
Data Classes ACTGCTGCTAGCTAG CON Constructed from sequence assemblies Expressed Sequence Tag (cDNA) DDBJ GenBank EST Genome Survey Sequence (high-throughput short sequence) ENA GSS Annotated / Raw HTC High-Throughput cDNA (unfinished) High-Throughput Genome sequencing (unfinished) HTG Ann SRA Trace Mass Genome Annotation MGA Clean-up Patent sequences PAT • Entries change over time (completely replaced) • Raw WGS entries assembled into contigs CON entries Sequence Tagged Site (short unique genomic sequences) STS Class Taxon Standard (high quality annotated sequence) STD Third Party Annotation (re-annotated and re-assembled) TPA Transcriptome Shotgun Assembly (computational assembly) TSA Whole Genome Shotgun WGS
Data Classes ACTGCTGCTAGCTAG How stable is the data? DDBJ GenBank ENA Data is always changing: Annotated / Raw • Assembly of sequences into larger fragments • Deletion of obsolete entries (i.e. once assembled) • Sequence modifications • Daily updates • Identifier changes • Corrections (databases can contain errors) • etc… Ann SRA Trace Clean-up Class Taxon
Data Classes ACTGCTGCTAGCTAG How does assembly affect entries? DDBJ GenBank ENA Example: Annotated / Raw WGS Shotgun Ann SRA Trace • Fragments in separate entry Clean-up CON Constructed • Join to make new CON entries Class Taxon • Old WGS entries archived Standard STD • Join into large STD entry • (e.g. Completed genome) • Add annotation • Old CON entries archived
Taxonomy ACTGCTGCTAGCTAG HUM Human Mouse DDBJ GenBank MUS Rodent ENA ROD Annotated / Raw MAM Mammal Vertebrate VRT Ann SRA Trace Fungi FUN Clean-up Other: Invertebrate INV Plant PLN Class Environmental ENV Taxon Prokaryote PRO Synthetic SYN Phage PHG Transgenic TGN Viral VIR Unclassified UNC
Taxonomy ACTGCTGCTAGCTAG HUM Human DDBJ GenBank Mouse MUS ENA Rodent ROD • CAUTION: organism never isolated • May blast sequence to assign putative organism Annotated / Raw MAM Mammal Ann SRA Trace Vertebrate VRT Fungi FUN Clean-up Other: Invertebrate INV Environmental Class ENV Plant PLN Taxon Synthetic Prokaryote SYN PRO Transgenic Phage TGN PHG Unclassified UNC Viral VIR
Taxonomy ACTGCTGCTAGCTAG HUM Human DDBJ GenBank Mouse MUS ENA Rodent ROD Annotated / Raw MAM Mammal • CAUTION: not consistently handled, variable quality • Transgenics may be from multiple organisms Ann SRA Trace Vertebrate VRT Fungi FUN Clean-up Other: Invertebrate INV Environmental Class ENV Plant PLN Taxon Synthetic Prokaryote SYN PRO Transgenic Phage TGN PHG Unclassified UNC Viral VIR
Taxonomy ACTGCTGCTAGCTAG HUM Human DDBJ GenBank Mouse MUS ENA Rodent ROD Annotated / Raw MAM Mammal Ann SRA Trace Vertebrate VRT • Division primarily used by GenBank • for PAT (patent) sequences Fungi FUN Clean-up Other: Invertebrate INV Environmental Class ENV Plant PLN Taxon Synthetic Prokaryote SYN PRO Transgenic Phage TGN PHG Unclassified UNC Viral VIR
Taxonomy exclusion ACTGCTGCTAGCTAG Some species excluded from certain taxonomic ranges DDBJ GenBank Rodent ENA ROD Annotated / Raw MAM Mammal Vertebrate VRT Ann SRA Trace excludes mouse Clean-up human mouse rodent excludes Class Taxon • Applies to: • ftp files and • sequence search tools • But not: • ENA Browser human mouse rodent mammal • excludes
Taxonomy Database ACTGCTGCTAGCTAG Which taxonomy database does ENA use? DDBJ GenBank ENA All INSDC databases use the NCBI Taxonomy Browser Annotated / Raw Only organisms with sequence are represented Ann SRA Trace Clean-up EBI Taxonomy Portal Class • EBI-wide service maps resources into taxonomy service • Culture collection – physical data, e.g. sample or stored version • Biomaterial • Specimen voucher Taxon representation, e.g. picture
Database Structure ACTGCTGCTAGCTAG How does data organization differ from GenBank? DDBJ GenBank GenBank ENA-Annotation ENA Annotated / Raw Data classes Data classes ... con gss htg sts est htc pat std Ann SRA Trace ... con est htc pat std gss htg sts hum Clean-up mus Taxonomic Divisions rod mam Taxonomic Divisions vrt fun Class ... mus mam fun pln rod vrt inv hum ... Taxon • Data split into parallel slices • Large search sets • Classes incomplete for taxonomy • Taxonomy incomplete for classes • Data split into intersecting slices • Reduces search set • Ensures complete result set
Database Structure ACTGCTGCTAGCTAG How does data organization differ from GenBank? DDBJ GenBank ENA-Annotation GenBank ENA • ‘EST’ set • large data set • includes all EST entries Annotated / Raw Data classes Data classes ... con gss htg sts est htc pat std Ann SRA Trace ... con est htc pat std gss htg sts hum Clean-up mus Taxonomic Divisions rod mam Taxonomic Divisions vrt fun Class • ‘Mouse’ set • large data set • includes all mouse entries ... mus mam fun pln rod vrt inv hum ... Taxon • ‘Mouse’ + ‘EST’ intersection • small data set • ensured complete set of mouse ESTs • Data split into intersecting slices • Reduces search set • Ensures complete result set • Data split into parallel slices • Large search sets • Classes incomplete for taxonomy • Taxonomy incomplete for classes
Data – Protein Sequence • UniProt databases: • UniProtKB: human curated and automatic translation sections • UniRef: non-redundant sequence clusters • UniParc: non-identical sequence archive • Sequence from structures: • PDB • SGT • Specialist data sets, e.g.: • Immunoglobulins: IMGT/HLA • Alternative splicing: ASD, ASTD • Completed proteomes: Ensembl, Integr8 • Protein Interactions: IntAct • Patent Proteins: EPO, JPO, KIPO and USPTO Sequence searching and alignments - Andrew Cowley
Sequence Databases Genbank
Manual curation • Literature-based annotation • Sequence analysis GO Functional info InterPro classification Some data sources for annotation Protein identification data PRIDE Signal prediction Protein families and domains InterPro Molecular interactions IntAct Transmembrane prediction IntEnz Enzymes • Automated annotation Other predictions Microbial protein families HAMAP Protein classification Post-translational modifications RESID Protein sequence: UniProt UniProt Sequence searching and alignments - Andrew Cowley
UniRef Pre-computed clusters of similar proteins UniProtKB UniProt Knowledgebase, the centrepiece of the UniProt Consortium’s activities. It provides a richly curated protein database. UniRef 50 UniRef 90 Proteome Sets IPI UniMes UniProt Metagenomic and Environmental Sequences (available by FTP only) UniRef 100 UniSave UniProtKB UniMes UniParc UniProt Sequence Archive. Contains all current and obsolete UniProtKB sequences UniParc UniSave UniProt protein entry archive. Contains all versions of each protein entry. (Accessed via www.uniprot.org and www.ebi.ac.uk/unisave) PDB Sub/ Peptide Data FlyBase WormBase Patent Data INSDC (incl. WGS, Env.) RefSeq Ensembl VEGA Database sources UniProt data sources and data flow
The Two Sides of UniProtKB UniProtKB/TrEMBL UniProtKB/Swiss-Prot Redundant, automatically annotated - unreviewed Non-redundant, high-quality manual annotation - reviewed
Databases • Many databases and they are getting bigger • Efficient searching involves knowledge of what is stored in these • Don’t assume that everything in the databases is correct • Nothing is constant, but changes... • Deletions, sequence modifications • Daily updates, identifier changes, etc. Sequence searching and alignments - Andrew Cowley
Searching databases Sequence searching and alignments - Andrew Cowley
What is the difference between a primary and secondary database? What methods of searching databases do you know of? ? What is the best protein sequence database to search(specific part)? ? ? Sequence searching and alignments - Andrew Cowley