150 likes | 340 Views
Indexing Genome Sequences. Srikanta B. J. Database Systems Lab (DSL) Indian Institute of Science. Genome Sequence Analysis. Hypothesize Function of Proteins Phylogenetic trees Causes of Diseases First step in unraveling the mystery of Life!
E N D
Indexing Genome Sequences Srikanta B. J. Database Systems Lab (DSL) Indian Institute of Science IITB - Bioinformatics Workshop 2001
Genome Sequence Analysis • Hypothesize • Function of Proteins • Phylogenetic trees • Causes of Diseases • First step in unraveling the mystery of Life! • Sequence Similarity Structural Similarity Functional Similarity IITB - Bioinformatics Workshop 2001
Sequence Similarity • Alignment • between two sequences, S1 & S2 (perhaps of unequal length) • Insert spaces, into or at the ends of S1(S2) • Place them so that every character or space in either string is opposite a unique character/space in the other.E.g.,q a c - d b dq a w x - b - • Global & Local Alignments IITB - Bioinformatics Workshop 2001
Alignment • Global • Given two sequences, find best alignment over full length • E.g., between (agtcacaaaact, actcgga) a g t c ac a a a a c t| | | | | | | | | | | |a c t c gg a - - - - - • Local • Look for “islands” of high similarity • E.g., between (agtcacaaaact, actcgga) a g t c a c a a a a c t | | | a c t c g g a O(mn) with Dynamic Programming IITB - Bioinformatics Workshop 2001
Search Process • Given sequence to be studied • Want all similar (global/local) known sequences • Collections of sequences • NCBI-GenBank, SwissProt etc. • Contain millions of sequences IITB - Bioinformatics Workshop 2001
State of the art • Dynamic Programming • Slow but accurate • Never misses a significant alignment • FastA • Faster than Dynamic Programming • Uses statistical heuristics • Reduced sensitivity False dismissals • BLAST • Fastest and popular • Lower sensitivity than FastA • Requires whole database in memory! IITB - Bioinformatics Workshop 2001
BLAST - on $1,000 Budget! • BODHI experience [DSL, 2001] • ~51,000 DNA sequences in database • CAFÉ Experience [Williams and Zobel, 2001] • ~120,000 DNA sequences in memory • Time - 67.1 seconds/BLAST 10.6 seconds / BLAST IITB - Bioinformatics Workshop 2001
NCBI GenBank Growth • Doubles every 13 months • In 1998, estimated 40,000 sequence similarity queries per day That was 3 years ago!! IITB - Bioinformatics Workshop 2001
We Need Indexes for Sequence Similarity Searching NOW!! IITB - Bioinformatics Workshop 2001
Indexed Searching • Inverted Indexes • RAMdb [Fondrat and Dessen, 1995] • CAFÉ [Williams and Zobel, 2001] • FLASH [Califano and Rigoutsos, 1993] • Multi-Dimensional Indexes • MRS-indexing [Kahveci and Singh, 2001] • Persistent Prefix Tree [Hunt et al., 2001] IITB - Bioinformatics Workshop 2001
RAMdb (Rapid Access Motif db) • Each sequence in repository is indexed by constituent overlapping sequences • 800-fold speedup over Dynamic Programming • Prohibitive index size • No ranking (goodness) of alignments • False dismissals ACTC Seq1, seq2,… Seq1, seq4,… CTCG IITB - Bioinformatics Workshop 2001
CAFÉ • Partitioned Search • Coarse searching with compressed inverted index • Fine searching in small fraction of database, with ranking • 14-fold speedup over BLAST • Compression reduces the index size • Distant sequence relationships are lost • Lower retrieval effectiveness IITB - Bioinformatics Workshop 2001
MRS - Indexing • Uses progressive wavelet coefficients to represent sequence IITB - Bioinformatics Workshop 2001
MRS-Indexing (contd.) • Builds a hierarchy of Multi-Dim. Indexes • Only for edit distances - no general scoring schemes • Not suited for average DNA/Protein query lengths IITB - Bioinformatics Workshop 2001
Summary • Rapid growth in sequence databases • Existing algorithms do not scale • Indexed approach to Sequence Similarity is necessary • Improvements needed in Indexed Searching methods IITB - Bioinformatics Workshop 2001