A Study of Computational Methods for Storing and Sequencing Genetic Databases

A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03

Abstract • Scope of Study (i.e. aspect of Genetic Databases) • Types of Genetic Databases • Storage/organization/access/manipulation techniques • Sequencing (querying) of data in Genetic Databases • Logical Layout of Genetic Databases

Brief Introduction • Human Genome Project (and others) -> Vast amount of biological data • Venture: Computer Science and Biology (BCB) -> Genetic Databases (map,genomic,proteomic) • Expected date of Completed map of human genome: end of 2003 • Next stage: Sequence comp. and Seq-Protein function. • Useful to Pharm. Companies (CADD – e.g. SKB’s Relenza).

Results - Sequence • Current Sequence Generation Technologies • Maxam-Gilbert (use chemicals to cleave DNA at a specific base/length) • Sanger (use enzymatic procedures to produce DNA based on specific base—i.e. length)

Derivation of nucleotide sequence from human chromosome

Results - Sequence • Types of Sequence Comparisons/alignmts. • Global (“How similar are these two sequences?”) • To find best overall alignment b/w two sequences • 1970: Needleman and Wunch (global, dynamic) • Shortcomings: in small similarities w/in 2 subseq. • Local (“What sequences in a database are most similar to this sequence?”) • To find the best subseq. match b/w two sequences • 1981: Smith and Waterman (local, dynamic) • Shortcomings: not computationally efficient, slow

Results - Sequence

Results - Sequence • Heuristic Search (Quick, Approximate) • Quickly search for “words” that match sequence. Then recursively perform local search on each matched word until no other matches • FASTA (1998), BLAST(1990) • Shortcomings: approximate not exact, E-Value (sig if <0.05)

Results – Sequence (CSC Implementation) • Sequence alignment can be represented as matrices and graphs (using rules and costs) • When converted into a directed acyclic graph, solution of the sequence alignment is the longest-path (max. path problem).

Results Sequence (CSC Implementation) • Can be solved dynamically as a ‘running max score’ (RMS). • For each D(i,j), best RMS = max(west+gap1, north+gap2, NW+current_score) • Replace D(i,j) with max • Needleman-Wunch Dynamic Program Diag. edge = character matches; down edge = gap in string 2; across edge = gap in string 1

Results – Sequence (CSC Implementation) • Similar to Smith-Waterman • Differences: • restricts RMS-discontinues if <0 after several iterations • For each iteration, saves max for each cell separately rather than replace->Trace back through max. scores for best local alignment • BLAST Implementation (http://www.ebi.ac.uk/blast2/#)

Results - Storage • EMBL Nucleotide Sequence Database (on Oracle) • Scale: over 130 tables, 140 relationships (80 GB of data) • Object Oriented Organization with Related 5 packages. • Operations that return attribute type->supports on demand object creation • ‘live object cache’ – copying most accessed instance of DB into cache by Primary key and performing queries on this cache.

Results - Storage • 5 EMBL Packages: • Sequence Info – general information on biological sequence. • Feature Info – sequence annotation/comment • Reference Info – bibliographic ref. on seq. • Taxonomy Info – taxonomy of organism’s sequence (i.e. kingdom, phyla, family, genus, species, e.t.c.) • Location Info – location of sequence on DNA/RNA

Results – Storage (Gen. Relation B/W 5 packages)

Results – Storage (Sequence Info)

Results – Storage (Feature Info)

Results – Storage (Reference Info)

Results – Storage (Taxonomy Info)

Results – Storage (Location Info)

Conclusion • Genetic Databases (3 main types) are essential to store, manage, and query the massive bio-data from studies like HGP. • Object Oriented Design and data organization • Sequence Analysis: Global (N-W), Local (S-W), Heuristic (FASTA, BLAST)

Conclusion - Future Enhancements • Storage/Management: highly dependent on hardware industry progress • Sequence Analysis: • Use of parallel prog. for faster analysis of 2 sequences (BLAZE-Stanford) • Faster means of comparing and aligning multiple sequences simultaneously (e.g. comparing novel protein sequence to family).

Any Questions?

A Study of Computational Methods for Storing and Sequencing Genetic Databases

A Study of Computational Methods for Storing and Sequencing Genetic Databases

Presentation Transcript

Cryptographic Methods for Storing Ballots on a Voting Machine

Next Generation Sequencing and Human Genome Databases

Computational Methods for Financial Applications

First Sequencing Methods

DNA sequencing methods

Storing and Querying XML Data in Databases

Computational assembly for prokaryotic sequencing projects

Materials and Methods: Computational methods

Explain the best methods of storing hydrogen.

Sequencing Technologies and Human Genetic Variation

Storing XML Data in Relational Databases

Storing and Querying XML Documents Using Relational Databases

Computational Methods

Storing And Manipulating Gridded Data In Spatial Databases

Transect Sampling Methods for a Minority Population Genetic Epidemiology Study

Computational Methods for Data Analysis

Computational Methods to study Sequencing data

Genetic Methods

Modeling Storing and Mining Moving Object Databases

DNA Sequencing Methods

Computational Methods for Financial Applications

Computational Methods for Chiral Fermions