Indexing Genome Sequences

Indexing Genome Sequences Srikanta B. J. Database Systems Lab (DSL) Indian Institute of Science IITB - Bioinformatics Workshop 2001

Genome Sequence Analysis • Hypothesize • Function of Proteins • Phylogenetic trees • Causes of Diseases • First step in unraveling the mystery of Life! • Sequence Similarity  Structural Similarity  Functional Similarity IITB - Bioinformatics Workshop 2001

Sequence Similarity • Alignment • between two sequences, S1 & S2 (perhaps of unequal length) • Insert spaces, into or at the ends of S1(S2) • Place them so that every character or space in either string is opposite a unique character/space in the other.E.g.,q a c - d b dq a w x - b - • Global & Local Alignments IITB - Bioinformatics Workshop 2001

Alignment • Global • Given two sequences, find best alignment over full length • E.g., between (agtcacaaaact, actcgga) a g t c ac a a a a c t| | | | | | | | | | | |a c t c gg a - - - - - • Local • Look for “islands” of high similarity • E.g., between (agtcacaaaact, actcgga) a g t c a c a a a a c t | | | a c t c g g a O(mn) with Dynamic Programming IITB - Bioinformatics Workshop 2001

Search Process • Given sequence to be studied • Want all similar (global/local) known sequences • Collections of sequences • NCBI-GenBank, SwissProt etc. • Contain millions of sequences IITB - Bioinformatics Workshop 2001

State of the art • Dynamic Programming • Slow but accurate • Never misses a significant alignment • FastA • Faster than Dynamic Programming • Uses statistical heuristics • Reduced sensitivity  False dismissals • BLAST • Fastest and popular • Lower sensitivity than FastA • Requires whole database in memory! IITB - Bioinformatics Workshop 2001

BLAST - on $1,000 Budget! • BODHI experience [DSL, 2001] • ~51,000 DNA sequences in database • CAFÉ Experience [Williams and Zobel, 2001] • ~120,000 DNA sequences in memory • Time - 67.1 seconds/BLAST  10.6 seconds / BLAST IITB - Bioinformatics Workshop 2001

NCBI GenBank Growth • Doubles every 13 months • In 1998, estimated 40,000 sequence similarity queries per day That was 3 years ago!! IITB - Bioinformatics Workshop 2001

We Need Indexes for Sequence Similarity Searching NOW!! IITB - Bioinformatics Workshop 2001

Indexed Searching • Inverted Indexes • RAMdb [Fondrat and Dessen, 1995] • CAFÉ [Williams and Zobel, 2001] • FLASH [Califano and Rigoutsos, 1993] • Multi-Dimensional Indexes • MRS-indexing [Kahveci and Singh, 2001] • Persistent Prefix Tree [Hunt et al., 2001] IITB - Bioinformatics Workshop 2001

RAMdb (Rapid Access Motif db) • Each sequence in repository is indexed by constituent overlapping sequences • 800-fold speedup over Dynamic Programming • Prohibitive index size • No ranking (goodness) of alignments • False dismissals ACTC Seq1, seq2,… Seq1, seq4,… CTCG IITB - Bioinformatics Workshop 2001

CAFÉ • Partitioned Search • Coarse searching with compressed inverted index • Fine searching in small fraction of database, with ranking • 14-fold speedup over BLAST • Compression reduces the index size • Distant sequence relationships are lost • Lower retrieval effectiveness IITB - Bioinformatics Workshop 2001

MRS - Indexing • Uses progressive wavelet coefficients to represent sequence IITB - Bioinformatics Workshop 2001

MRS-Indexing (contd.) • Builds a hierarchy of Multi-Dim. Indexes • Only for edit distances - no general scoring schemes • Not suited for average DNA/Protein query lengths IITB - Bioinformatics Workshop 2001

Summary • Rapid growth in sequence databases • Existing algorithms do not scale • Indexed approach to Sequence Similarity is necessary • Improvements needed in Indexed Searching methods IITB - Bioinformatics Workshop 2001

Indexing Genome Sequences

Indexing Genome Sequences

Presentation Transcript

Genome-scale disk-based suffix tree indexing

Aligning Multiple Genome Sequences With the Threaded Blockset Aligner

Genome-scale Disk-based Suffix Tree Indexing

Sequences and Indexing

Indexing

Sequences and Indexing

Indexing DNA Sequences Using q-Grams

The foundation Full Genome Sequences And Their Annotations

Efficient Indexing of Versioned Document Sequences

What do genome sequences reveal?

In collaboration with: Computer Vision Group Indexing Video Sequences

Extracting genetic variation from human genome sequences Stephen Sherry, PhD

Mapping NGS sequences to a reference genome

Genome Sequences

From Genome Sequences to Regulatory Network Phenotypes

Computational Analysis of Genome Sequences

Extracting homoeologous genomic sequences – the challenge of the wheat genome

Chapter 2 3. Genome sequences and gene numbers

Genome Sequences/ the Human Genome Project Dr. Chris Evelo

WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

Indexing

Aligning Multiple Genome Sequences With the Threaded Blockset Aligner