230 likes | 356 Views
A Study of Computational Methods for Storing and Sequencing Genetic Databases. CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03. Abstract. Scope of Study (i.e. aspect of Genetic Databases) Types of Genetic Databases Storage/organization/access/manipulation techniques
E N D
A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03
Abstract • Scope of Study (i.e. aspect of Genetic Databases) • Types of Genetic Databases • Storage/organization/access/manipulation techniques • Sequencing (querying) of data in Genetic Databases • Logical Layout of Genetic Databases
Brief Introduction • Human Genome Project (and others) -> Vast amount of biological data • Venture: Computer Science and Biology (BCB) -> Genetic Databases (map,genomic,proteomic) • Expected date of Completed map of human genome: end of 2003 • Next stage: Sequence comp. and Seq-Protein function. • Useful to Pharm. Companies (CADD – e.g. SKB’s Relenza).
Results - Sequence • Current Sequence Generation Technologies • Maxam-Gilbert (use chemicals to cleave DNA at a specific base/length) • Sanger (use enzymatic procedures to produce DNA based on specific base—i.e. length)
Results - Sequence • Types of Sequence Comparisons/alignmts. • Global (“How similar are these two sequences?”) • To find best overall alignment b/w two sequences • 1970: Needleman and Wunch (global, dynamic) • Shortcomings: in small similarities w/in 2 subseq. • Local (“What sequences in a database are most similar to this sequence?”) • To find the best subseq. match b/w two sequences • 1981: Smith and Waterman (local, dynamic) • Shortcomings: not computationally efficient, slow
Results - Sequence • Heuristic Search (Quick, Approximate) • Quickly search for “words” that match sequence. Then recursively perform local search on each matched word until no other matches • FASTA (1998), BLAST(1990) • Shortcomings: approximate not exact, E-Value (sig if <0.05)
Results – Sequence (CSC Implementation) • Sequence alignment can be represented as matrices and graphs (using rules and costs) • When converted into a directed acyclic graph, solution of the sequence alignment is the longest-path (max. path problem).
Results Sequence (CSC Implementation) • Can be solved dynamically as a ‘running max score’ (RMS). • For each D(i,j), best RMS = max(west+gap1, north+gap2, NW+current_score) • Replace D(i,j) with max • Needleman-Wunch Dynamic Program Diag. edge = character matches; down edge = gap in string 2; across edge = gap in string 1
Results – Sequence (CSC Implementation) • Similar to Smith-Waterman • Differences: • restricts RMS-discontinues if <0 after several iterations • For each iteration, saves max for each cell separately rather than replace->Trace back through max. scores for best local alignment • BLAST Implementation (http://www.ebi.ac.uk/blast2/#)
Results - Storage • EMBL Nucleotide Sequence Database (on Oracle) • Scale: over 130 tables, 140 relationships (80 GB of data) • Object Oriented Organization with Related 5 packages. • Operations that return attribute type->supports on demand object creation • ‘live object cache’ – copying most accessed instance of DB into cache by Primary key and performing queries on this cache.
Results - Storage • 5 EMBL Packages: • Sequence Info – general information on biological sequence. • Feature Info – sequence annotation/comment • Reference Info – bibliographic ref. on seq. • Taxonomy Info – taxonomy of organism’s sequence (i.e. kingdom, phyla, family, genus, species, e.t.c.) • Location Info – location of sequence on DNA/RNA
Conclusion • Genetic Databases (3 main types) are essential to store, manage, and query the massive bio-data from studies like HGP. • Object Oriented Design and data organization • Sequence Analysis: Global (N-W), Local (S-W), Heuristic (FASTA, BLAST)
Conclusion - Future Enhancements • Storage/Management: highly dependent on hardware industry progress • Sequence Analysis: • Use of parallel prog. for faster analysis of 2 sequences (BLAZE-Stanford) • Faster means of comparing and aligning multiple sequences simultaneously (e.g. comparing novel protein sequence to family).