250 likes | 254 Views
This paper explores the concept of sequential and direct data access in databases, focusing on indexing and tools like EMBOSS and BLAST. It provides examples and instructions for index creation and querying.
E N D
Databasesindexation Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node
Data access concept sequential direct Indexing EMBOSS Fetch Other BLAST Why indexing? formatdb Parsing output Excel import/export Tab delimited Coma delimited Overview
Sequential access Direct access track sector head Data access: sequential vs direct Vary from very short to very long Very small variations
Flat files = sequential Indexing = simulated direct Similar concept for databases >seq1 cgatgtcatgtg >seq2 cgatcgtagctgtagctgtag >seq3 catgtgcatgcgacgt
EMBOSS dbiflat dbifasta dbiblast seqret seqretsplit entret Other examples SRS (icarus language) http://srs.ebi.ac.uk http://www.lionbioscience.com/ indexer & fetch (warning local SIB tool) Relational (MySQL, Oracle…) Tools
Where is your file? What is the format? Where should be the indices? Where is the emboss.default file? (.embossrc) Other EMBOSS tools textsearch whichdb EMBOSS how to index?
Input file and directory ~/embossidx/ECOLI.dat cd embossidx Index creation dbiflat -idformat swiss -dbname ECOLI.dat -directory . -release 1.0 -date 12/02/05 -fields AC Generates 4 files acnum.hit acnum.trg division.lkp entrynam.idx Don’t forget to modify ~/.embossrc EMBOSS example
Example of queries seqret ecoli:thio_ecoli seqret ecoli:P00274 entret ecoli:thio_ecoli and even seqret ‘ecoli:*_ECOLI’ set emboss_filter 1 # Ecoli DB ecoli [ type: P comment: "E.coli proteome" method: emblcd format: swiss dir: "~/embossidx" file: "ECOLI.dat" release: "1.0" indexdir: "~/embossidx" ] .embossrc
Warning this is a local SIB tool!! Input file and directory ~/embossidx/ECOLI.dat cd embossidx Index creation indexer -h '^ID' -t '^//' -i -p '^ID\s+(\S+)' ECOLI.dat ecoli.idx Generates 1 file ecoli.idx Don’t forget to modify config file Indexer & fetch
Example of queries fetch -c fetch.conf ecoli:thio_ecoli fetch -c fetch.conf -f ‘ecoli:thio_ecoli[20..50]’ fetch.conf #dbkey format indexfile datafile ecoli sp ~/embossidx/ecoli.idx ~/embossidx/ECOLI.dat Config file: fetch.conf
Maintained at NCBI Source distributed freely with several accessory tools ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/ncbi.tar.gz Requires compilation to install on your local computer blastall contains blastp blastn blastx tblastn tblastx Other tools blastpgp megablast formatdb BLAST
Program Query Database protein protein VS blastp nucleotide blastn nucleotide VS blastx nucleotide protein VS protein tblastn nucleotide protein VS protein nucleotide nucleotide tblastx protein VS protein Available Blast programs
Indexing all words of 3 aa or 11 bp in the sequence database Searching the query for all words of a score > T Search the indexed database for all perfect matches Try to align matches that are on the same diagonal What makes BLAST so fast?
A substitution matrix is used to compute the word scores Query REL RSL score > T LKP AAA AAC AAD YYY score < T ACT RSL TVF ... ... ... List of all possible words with 3 amino acid residues (8000) List of words matching the query with a score > T LKP Indexing for Blast (1)
Database sequences ACT ACT RSL Search for exact matches ACT RSL TVF RSL ... RSL TVF ... RSL TVF List of words matching the query with a score > T • List of sequences containing words similar to the query (hits) Indexing for Blast (2)
Database sequence Query A Extension using dynamic programming limited to a restricted region limited through a score drop-off threshold Indexing for Blast (3) Database sequence Query A Ungapped extension if: 2 "Hits" are on the same diagonal but at a distance less than A
Formatdb mydb.seq must contain sequences in FASTA format formatdb -i mydb.seq -p T -n mydb Generates 3 files mydb.psq mydb.pin mydb.phr Then start a Blast: blastall -p blastp -d mydb -i myseq (-optional parameters) BLAST indexing with formatdb
blastall Executed locally Slow No need to transfert db blastall.remote Executed remotely Fast Requires special priviledges and db transfert Blast local vs remote
1 seq vs db seq 1 FASTA seq as input db seq vs db seq Several single FASTA seq files as input or 1 Multiple FASTA seq file as input Possibility to export results as XML Use Perl to automatize the queries and parse the output Multiple Blasts?
Parsing Blast output BLASTP 2.2.10 [Oct-19-2004] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= ACCA_BACSU O34847 Acetyl-coenzyme A carboxylase carboxyl transferase subunit alpha (EC 6.4.1.2). (325 letters) Database: ecoli_blast 4339 sequences; 1,373,039 total letters Searching.........done Score E Sequences producing significant alignments: (bits) Value ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyl transfe... 266 1e-72
Parsing Blast output (2) >ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyl transferase subunit alpha (EC 6.4.1.2). Length = 318 Score = 266 bits (681), Expect = 1e-72 Identities = 143/312 (45%), Positives = 188/312 (60%), Gaps = 3/312 (0%) Query: 5 LEFEKPVIELQTKIAELKKFTQDS---DMDLSAEIERLEDRLAKLQDDIYKNLKPWDRVQ 61 L+FE+P+ EL+ KI L ++ D+++ E+ RL ++ +L I+ +L W Q Sbjct: 5 LDFEQPIAELEAKIDSLTAVSRQDEKLDINIDEEVHRLREKSVELTRKIFADLGAWQIAQ 64 Query: 62 IARLADRPTTLDYIEHLFTDFFECHGDRAYGDDEAIVGGIAKFHGLPVTVIGHQRGKDTK 121 +AR RP TLDY+ F +F E GDRAY DD+AIVGGIA+ G PV +IGHQ+G++TK Sbjct: 65 LARHPQRPYTLDYVRLAFDEFDELAGDRAYADDKAIVGGIARLDGRPVMIIGHQKGRETK 124 Query: 122 ENLVRNFGMPHPEGYRKALRLMKQADKFNRPIICFIDTKGAYPGRAAEERGQSEAIAKNL 181 E + RNFGMP PEGYRKALRLM+ A++F PII FIDT GAYPG AEERGQSEAIA+NL Sbjct: 125 EKIRRNFGMPAPEGYRKALRLMQMAERFKMPIITFIDTPGAYPGVGAEERGQSEAIARNL 184 Query: 182 FEMAGLRVPXXXXXXXXXXXXXXXXXXXXXXXHMLENSTYSVISPEGAAALLWKDSSLAK 241 EM+ L VP +ML+ STYSVISPEG A++LWK + A Sbjct: 185 REMSRLGVPVVCTVIGEGGSGGALAIGVGDKVNMLQYSTYSVISPEGCASILWKSADKAP 244 Query: 242 KAAETMKITAPDLKELGIIDHMIKEVKGGAHHDVKLQASYMDXXXXXXXXXXXXXXXXXX 301 AAE M I AP LKEL +ID +I E GGAH + + A+ + Sbjct: 245 LAAEAMGIIAPRLKELKLIDSIIPEPLGGAHRNPEAMAASLKAQLLADLADLDVLSTEDL 304 Query: 302 VQQRYEKYKAIG 313 +RY++ + G Sbjct: 305 KNRRYQRLMSYG 316
With BioPerl: #!/usr/local/bin/perl use Bio::SearchIO; my $blast_report = new Bio::SearchIO ('-format' => 'blast', '-file' => $ARGV[0]); print "Query name:\tQuery description:\tHit name:\tHit description:\tE-value\tScore\n"; while( my $result = $blast_report->next_result) { print $result->query_name(), "\t", $result->query_description(), "\n"; while( my $hit = $result->next_hit()) { print "\t\t", $hit->name(), "\t", $hit->description(); while( my $hsp = $hit->next_hsp()) { print "\t", $hsp->evalue(), "\t", $hsp->score(); } print "\n"; } } exit 0; Parsing Blast output (3)
Excel can import Tab delimited Coma delimited Excel can export Tab delimited Space delimited MS-Excel import/export AC/ID desc score e-value THIO_ECOLI thioredoxin Escherichia coli 234 2.1e-5 THIO_HUMAN thioredoxin Homo sapiens 120 0.001
Tab delimited file: \t delimits the columns \n delimits the lines Optional first line contains columns title Example: AC/ID\tdesc\tscore\te-value\n THIO_ECOLI\tthioredoxinEscherichia coli\t234\t2.1e-5\n THIO_HUMAN\tthioredoxinHomo sapiens\t120\t0.001\n MS-Excel import/export
Coma delimited file: , delimits the columns, each value is surrounded by ‘ ’ \n delimits the lines Optional first line contains columns title Example: ‘AC/ID’,’desc’,’score’,’e-value’\n ’THIO_ECOLI’,’thioredoxinEscherichia coli’,’234’,’2.1e-5’\n ’THIO_HUMAN’,’thioredoxinHomo sapiens’,’120’,’0.001’\n MS-Excel import/export