1 / 25

Databases indexation

Databases indexation. Laurent Falquet, EPFL March, 2005. Swiss Institute of Bioinformatics Swiss EMBnet node. Data access concept sequential direct Indexing EMBOSS Fetch Other. BLAST Why indexing? formatdb Parsing output Excel import/export Tab delimited Coma delimited. Overview.

cassidyl
Download Presentation

Databases indexation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Databasesindexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node

  2. Data access concept sequential direct Indexing EMBOSS Fetch Other BLAST Why indexing? formatdb Parsing output Excel import/export Tab delimited Coma delimited Overview

  3. Sequential access Direct access track sector head Data access: sequential vs direct Vary from very short to very long Very small variations

  4. Flat files = sequential Indexing = simulated direct Similar concept for databases >seq1 cgatgtcatgtg >seq2 cgatcgtagctgtagctgtag >seq3 catgtgcatgcgacgt

  5. EMBOSS dbiflat dbifasta dbiblast seqret seqretsplit entret Other examples SRS (icarus language) http://srs.ebi.ac.uk http://www.lionbioscience.com/ indexer & fetch (warning local SIB tool) Relational (MySQL, Oracle…) Tools

  6. Where is your file? What is the format? Where should be the indices? Where is the emboss.default file? (.embossrc) Other EMBOSS tools textsearch whichdb EMBOSS how to index?

  7. Input file and directory ~/embossidx/ECOLI.dat cd embossidx Index creation dbiflat -idformat swiss -dbname ECOLI.dat -directory . -release 1.0 -date 12/02/05 -fields AC Generates 4 files acnum.hit acnum.trg division.lkp entrynam.idx Don’t forget to modify ~/.embossrc EMBOSS example

  8. Example of queries seqret ecoli:thio_ecoli seqret ecoli:P00274 entret ecoli:thio_ecoli and even seqret ‘ecoli:*_ECOLI’ set emboss_filter 1 # Ecoli DB ecoli [ type: P comment: "E.coli proteome" method: emblcd format: swiss dir:  "~/embossidx" file: "ECOLI.dat" release: "1.0" indexdir:  "~/embossidx" ] .embossrc

  9. Warning this is a local SIB tool!! Input file and directory ~/embossidx/ECOLI.dat cd embossidx Index creation indexer -h '^ID' -t '^//' -i -p '^ID\s+(\S+)' ECOLI.dat ecoli.idx Generates 1 file ecoli.idx Don’t forget to modify config file Indexer & fetch

  10. Example of queries fetch -c fetch.conf ecoli:thio_ecoli fetch -c fetch.conf -f ‘ecoli:thio_ecoli[20..50]’ fetch.conf #dbkey format indexfile datafile ecoli sp ~/embossidx/ecoli.idx ~/embossidx/ECOLI.dat Config file: fetch.conf

  11. Maintained at NCBI Source distributed freely with several accessory tools ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/ncbi.tar.gz Requires compilation to install on your local computer blastall contains blastp blastn blastx tblastn tblastx Other tools blastpgp megablast formatdb BLAST

  12. Program Query Database protein protein VS blastp nucleotide blastn nucleotide VS blastx nucleotide protein VS protein tblastn nucleotide protein VS protein nucleotide nucleotide tblastx protein VS protein Available Blast programs

  13. Indexing all words of 3 aa or 11 bp in the sequence database Searching the query for all words of a score > T Search the indexed database for all perfect matches Try to align matches that are on the same diagonal What makes BLAST so fast?

  14. A substitution matrix is used to compute the word scores Query REL RSL score > T LKP AAA AAC AAD YYY score < T ACT RSL TVF ... ... ... List of all possible words with 3 amino acid residues (8000) List of words matching the query with a score > T LKP Indexing for Blast (1)

  15. Database sequences ACT ACT RSL Search for exact matches ACT RSL TVF RSL ... RSL TVF ... RSL TVF List of words matching the query with a score > T • List of sequences containing words similar to the query (hits) Indexing for Blast (2)

  16. Database sequence Query A Extension using dynamic programming limited to a restricted region limited through a score drop-off threshold Indexing for Blast (3) Database sequence Query A Ungapped extension if: 2 "Hits" are on the same diagonal but at a distance less than A

  17. Formatdb mydb.seq must contain sequences in FASTA format formatdb -i mydb.seq -p T -n mydb Generates 3 files mydb.psq mydb.pin mydb.phr Then start a Blast: blastall -p blastp -d mydb -i myseq (-optional parameters) BLAST indexing with formatdb

  18. blastall Executed locally Slow No need to transfert db blastall.remote Executed remotely Fast Requires special priviledges and db transfert Blast local vs remote

  19. 1 seq vs db seq 1 FASTA seq as input db seq vs db seq Several single FASTA seq files as input or 1 Multiple FASTA seq file as input Possibility to export results as XML Use Perl to automatize the queries and parse the output Multiple Blasts?

  20. Parsing Blast output BLASTP 2.2.10 [Oct-19-2004] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= ACCA_BACSU O34847 Acetyl-coenzyme A carboxylase carboxyl transferase subunit alpha (EC 6.4.1.2). (325 letters) Database: ecoli_blast 4339 sequences; 1,373,039 total letters Searching.........done Score E Sequences producing significant alignments: (bits) Value ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyl transfe... 266 1e-72

  21. Parsing Blast output (2) >ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyl transferase subunit alpha (EC 6.4.1.2). Length = 318 Score = 266 bits (681), Expect = 1e-72 Identities = 143/312 (45%), Positives = 188/312 (60%), Gaps = 3/312 (0%) Query: 5 LEFEKPVIELQTKIAELKKFTQDS---DMDLSAEIERLEDRLAKLQDDIYKNLKPWDRVQ 61 L+FE+P+ EL+ KI L ++ D+++ E+ RL ++ +L I+ +L W Q Sbjct: 5 LDFEQPIAELEAKIDSLTAVSRQDEKLDINIDEEVHRLREKSVELTRKIFADLGAWQIAQ 64 Query: 62 IARLADRPTTLDYIEHLFTDFFECHGDRAYGDDEAIVGGIAKFHGLPVTVIGHQRGKDTK 121 +AR RP TLDY+ F +F E GDRAY DD+AIVGGIA+ G PV +IGHQ+G++TK Sbjct: 65 LARHPQRPYTLDYVRLAFDEFDELAGDRAYADDKAIVGGIARLDGRPVMIIGHQKGRETK 124 Query: 122 ENLVRNFGMPHPEGYRKALRLMKQADKFNRPIICFIDTKGAYPGRAAEERGQSEAIAKNL 181 E + RNFGMP PEGYRKALRLM+ A++F PII FIDT GAYPG AEERGQSEAIA+NL Sbjct: 125 EKIRRNFGMPAPEGYRKALRLMQMAERFKMPIITFIDTPGAYPGVGAEERGQSEAIARNL 184 Query: 182 FEMAGLRVPXXXXXXXXXXXXXXXXXXXXXXXHMLENSTYSVISPEGAAALLWKDSSLAK 241 EM+ L VP +ML+ STYSVISPEG A++LWK + A Sbjct: 185 REMSRLGVPVVCTVIGEGGSGGALAIGVGDKVNMLQYSTYSVISPEGCASILWKSADKAP 244 Query: 242 KAAETMKITAPDLKELGIIDHMIKEVKGGAHHDVKLQASYMDXXXXXXXXXXXXXXXXXX 301 AAE M I AP LKEL +ID +I E GGAH + + A+ + Sbjct: 245 LAAEAMGIIAPRLKELKLIDSIIPEPLGGAHRNPEAMAASLKAQLLADLADLDVLSTEDL 304 Query: 302 VQQRYEKYKAIG 313 +RY++ + G Sbjct: 305 KNRRYQRLMSYG 316

  22. With BioPerl: #!/usr/local/bin/perl use Bio::SearchIO; my $blast_report = new Bio::SearchIO ('-format' => 'blast', '-file' => $ARGV[0]); print "Query name:\tQuery description:\tHit name:\tHit description:\tE-value\tScore\n"; while( my $result = $blast_report->next_result) { print $result->query_name(), "\t", $result->query_description(), "\n"; while( my $hit = $result->next_hit()) {    print "\t\t", $hit->name(), "\t", $hit->description();    while( my $hsp = $hit->next_hsp()) {  print "\t", $hsp->evalue(), "\t", $hsp->score();    } print "\n"; } } exit 0; Parsing Blast output (3)

  23. Excel can import Tab delimited Coma delimited Excel can export Tab delimited Space delimited MS-Excel import/export AC/ID desc score e-value THIO_ECOLI thioredoxin Escherichia coli 234 2.1e-5 THIO_HUMAN thioredoxin Homo sapiens 120 0.001

  24. Tab delimited file: \t delimits the columns \n delimits the lines Optional first line contains columns title Example: AC/ID\tdesc\tscore\te-value\n THIO_ECOLI\tthioredoxinEscherichia coli\t234\t2.1e-5\n THIO_HUMAN\tthioredoxinHomo sapiens\t120\t0.001\n MS-Excel import/export

  25. Coma delimited file: , delimits the columns, each value is surrounded by ‘ ’ \n delimits the lines Optional first line contains columns title Example: ‘AC/ID’,’desc’,’score’,’e-value’\n ’THIO_ECOLI’,’thioredoxinEscherichia coli’,’234’,’2.1e-5’\n ’THIO_HUMAN’,’thioredoxinHomo sapiens’,’120’,’0.001’\n MS-Excel import/export

More Related