1 / 23

A Study of Computational Methods for Storing and Sequencing Genetic Databases

A Study of Computational Methods for Storing and Sequencing Genetic Databases. CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03. Abstract. Scope of Study (i.e. aspect of Genetic Databases) Types of Genetic Databases Storage/organization/access/manipulation techniques

Download Presentation

A Study of Computational Methods for Storing and Sequencing Genetic Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03

  2. Abstract • Scope of Study (i.e. aspect of Genetic Databases) • Types of Genetic Databases • Storage/organization/access/manipulation techniques • Sequencing (querying) of data in Genetic Databases • Logical Layout of Genetic Databases

  3. Brief Introduction • Human Genome Project (and others) -> Vast amount of biological data • Venture: Computer Science and Biology (BCB) -> Genetic Databases (map,genomic,proteomic) • Expected date of Completed map of human genome: end of 2003 • Next stage: Sequence comp. and Seq-Protein function. • Useful to Pharm. Companies (CADD – e.g. SKB’s Relenza).

  4. Results - Sequence • Current Sequence Generation Technologies • Maxam-Gilbert (use chemicals to cleave DNA at a specific base/length) • Sanger (use enzymatic procedures to produce DNA based on specific base—i.e. length)

  5. Derivation of nucleotide sequence from human chromosome

  6. Results - Sequence • Types of Sequence Comparisons/alignmts. • Global (“How similar are these two sequences?”) • To find best overall alignment b/w two sequences • 1970: Needleman and Wunch (global, dynamic) • Shortcomings: in small similarities w/in 2 subseq. • Local (“What sequences in a database are most similar to this sequence?”) • To find the best subseq. match b/w two sequences • 1981: Smith and Waterman (local, dynamic) • Shortcomings: not computationally efficient, slow

  7. Results - Sequence

  8. Results - Sequence • Heuristic Search (Quick, Approximate) • Quickly search for “words” that match sequence. Then recursively perform local search on each matched word until no other matches • FASTA (1998), BLAST(1990) • Shortcomings: approximate not exact, E-Value (sig if <0.05)

  9. Results – Sequence (CSC Implementation) • Sequence alignment can be represented as matrices and graphs (using rules and costs) • When converted into a directed acyclic graph, solution of the sequence alignment is the longest-path (max. path problem).

  10. Results Sequence (CSC Implementation) • Can be solved dynamically as a ‘running max score’ (RMS). • For each D(i,j), best RMS = max(west+gap1, north+gap2, NW+current_score) • Replace D(i,j) with max • Needleman-Wunch Dynamic Program Diag. edge = character matches; down edge = gap in string 2; across edge = gap in string 1

  11. Results – Sequence (CSC Implementation) • Similar to Smith-Waterman • Differences: • restricts RMS-discontinues if <0 after several iterations • For each iteration, saves max for each cell separately rather than replace->Trace back through max. scores for best local alignment • BLAST Implementation (http://www.ebi.ac.uk/blast2/#)

  12. Results - Storage • EMBL Nucleotide Sequence Database (on Oracle) • Scale: over 130 tables, 140 relationships (80 GB of data) • Object Oriented Organization with Related 5 packages. • Operations that return attribute type->supports on demand object creation • ‘live object cache’ – copying most accessed instance of DB into cache by Primary key and performing queries on this cache.

  13. Results - Storage • 5 EMBL Packages: • Sequence Info – general information on biological sequence. • Feature Info – sequence annotation/comment • Reference Info – bibliographic ref. on seq. • Taxonomy Info – taxonomy of organism’s sequence (i.e. kingdom, phyla, family, genus, species, e.t.c.) • Location Info – location of sequence on DNA/RNA

  14. Results – Storage (Gen. Relation B/W 5 packages)

  15. Results – Storage (Sequence Info)

  16. Results – Storage (Feature Info)

  17. Results – Storage (Reference Info)

  18. Results – Storage (Taxonomy Info)

  19. Results – Storage (Location Info)

  20. Conclusion • Genetic Databases (3 main types) are essential to store, manage, and query the massive bio-data from studies like HGP. • Object Oriented Design and data organization • Sequence Analysis: Global (N-W), Local (S-W), Heuristic (FASTA, BLAST)

  21. Conclusion - Future Enhancements • Storage/Management: highly dependent on hardware industry progress • Sequence Analysis: • Use of parallel prog. for faster analysis of 2 sequences (BLAZE-Stanford) • Faster means of comparing and aligning multiple sequences simultaneously (e.g. comparing novel protein sequence to family).

  22. Any Questions?

More Related