400 likes | 521 Views
M o B I o S M o B I o S. S o I B o M S o I B o M. The MoBIoS Project Mo lecular B iological I nformation S ystem. Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics University of Texas.
E N D
M o B I o S M o B I o S S o I B o M S o I B o M The MoBIoS ProjectMolecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics University of Texas Weijia Xu, Rui Mao, Will Briggs, Smriti Ramakrishnan, Shu Wang, Lulu Zhang
Problem:In Life Sciencses, database management systems (DBMS) serve as glorified file managers. • Little use of sophisticated data and pattern-based retrieval • Real scientific and technological problems
Primary data is stored in text or blob fields Annotations may be relational Data retrieval Filter DB, sequential dump, O(n), to utilities E.g. BLAST, When biological data is put in to an RDBMS
Linear Data Scans, O(n), Endemic in Life Sciences • Sequences: • DNA, RNA, Protein databases • Mass Spectra • proteomics • Small Molecules & Protein Structure • Protein interaction • Rational drug design • Pathways (graphs) • Phylogenies (graphs, trees in particular)
Scope: To Find Common Ground Both Biology and DBMS’ Have to Move DBMS Biological Information System Metric-Space Database as the Common Ground
Metric Space is • a pair, M=(D,d), where • D is a set of points • d is [metric] distance function with the following properties: • d(x,y) = d (y,x) (symmetry) • d(x, y) > 0, d(x,x) = 0 (non negativity) • d(x,z) <= d(x,y) + d(y,z) (triangle inequality) x y z
A Spatial Database Management System: Extend relational DBMS Special indexes for 2D and 3D data; k-d and R-trees New data types Geographic information systems Topographic maps Buildings and the like A Metric-Space Database Management System Extend Relational DBMS Special indexes for metric-spaces New data types Biological information system Life science data types Definition - By Analogy
Develop index structures to support distance & nearest-neighbor queries • Well studied in main-memory • But by no means a closed problem • In databases (external/disk based methods) • Embryonic • Many myths • Often assumed to be the basis of multimedia database systems
How to build a metric-space index • Three algorithmic classes [Tasan, Ozsoyoglu 04] • Vantage points • Hyperplanes • Bounding spheres
Vantage Point Method Choose a point,VP And a radius, R
Vantage Point Method • Given VP, R • The predicates • d(VP,x) < R • d(VP,x) R • Divide the set into two equal halves • apply recursively Choose a point,VP And a radius,R
Query, q, range r • if • d(q,VP) > R + r • then • all neighbors are outside the sphere VP R r q
Multi-vantage point method • Consider d(VPi, x) a projection onto an axis • Looks like a k-d tree • Choose number k & d
Myths • Solved problem; M-trees [Ciaccia et.al. 96, 97] • I can’t get them to work on anything but their original synthetic data generator • Good choice for vantage points is to find corners[Yianilos93] (farthest-first clustering) • Might be true for euclidean spaces • Early result, not true for our data • High dimensional indexing always asymptotically reduces to linear scans. • Formal result based on an assumption of uniform data distributions.
Figure 9. Comparison of metric-space index structures: RBT, GHT, and VPT Comparison of Three Methods of Metric-Space Indexing
Open problems • Is there a general metric-space index structure that is generally good for most work loads. • We are optimistic mvp tree’s – further tuning will be a useful answer • Hyperplane methods are fair game – there is circumstantial evidence that that is key component in Google’s search engine. • No work addresses clustering data pages on disk. • Metric-space join algorithms
Biological Models are Usually Based on Similarity Similarity • Biologist like scoring functions that reward each similar feature with a positive number • Intuitive Distance: • More Similar smaller numbers • Identical 0
But Do Metric Models Capture Biology? • Metrics are a subset of possible mathematical models .
Sequence Problem 1 Sequence similarity based on weighted edit distance Accepted weight matrices, PAM & BLOSSUM, are not metric • Log-odd matrices – negative values • Defy simple algebraic normalization[TaylorJones93,Linialetal97]
Our First Result: mPAM [Xu&Miranker04] • Dayhoffetal’s PAM Derivation[74] • Took a set of closely related protein sequences • Developed a phylogenetic tree • Counted substitutions to transform one sequence to another • Tree determines a measure of time
PAM vs. mPAM: t = 1/f Using original substitution counts • PAM: frequency of substitution S(a,b|t) = log P(b|a,t)/qb • mPAM: expected time between substitutions D(a,b) = 1/log(1 – (P(a,x)P(b,x)) x
Sequence Problem 2 • Sequences long units (identity for storage and retrieval) • Genes • Chromosomes • Analysis comprises comparing small substrings
Soln: Sequence View • New view type • Breaks sequences into q-grams create SEQUENCEVIEW rice_sview as SELECT CREATE FRAGMENTS (…, 3, 1) FROM … WHERE … USING HAMMING-DISTANCE
Materialize as an Index D(AAA) ≤ 2 { {
Status • Started with McKoi • A Java open source object-relational DBMS • (Think of Postgress written in Java) • Added • Biological data types • Metric-space index • Extending SQL engine (in progress)
Compare Arabidopsis Genome X Rice Genome Locate nucleotide patterns of form primer pair candidate Eliminate non-unique primer candidates Merge overlapping primer candidates Usual implementations O(n2), n = 109 Computed in MoBIoS Rice Arab. 18 Matching Nucleotides 18 Matching Nucleotides • Rice Gap 400 – 3000 Long • Arab. Gap 400 – 3000 Long
mSQL Query to locate candidate primer pairs SELECT merge(R1.fragment, A1.fragment) FROM G1_sview R1, G1_sview R2, G2_sview A1, G2_sview A2 WHERE distance(‘HAMMINGDISTANCE', R1.fragment, A1.fragment) <= 1.0 AND distance(‘HAMMINGDISTANCE', R2.fragment, A2.fragment) <= 1.0 AND (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) >= 400 AND (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) <= 3000 AND (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) >= 400 AND (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) <= 3000 GROUP BY R1.fragment, A1.fragment;
Query Plan Arab. Genome,O(n) Rice Genome, O(m) Offline:Build Sequence View O(n log n) Compare O(mlogn) Indexed Nested Loop Eliminate Duplicates Eliminate Low Complexity Primers (LZ compression) Merge Overlapping Primers ~10,000 conserved primer pairs candidates
Preliminary Results • Found 13,418 possible primer pairs from MoBIoS • 100 best candidates BLASTed for matches in GenBank • 15 matched other plant genes and the primers • At least 2 of 15 showed potential after PCR amplification against Helianthus and Phalaenopsis.
MoBIoS Architecture(Molecular Biological Information System)
Analysing Mass-Spectra Spectrum = Histogram of Mass/Charge Ratios of a collection peptides Similarity = Shared peaks count = Inner Product (0100101) • (0111100) = 2
Cosine Distance Approx. Inner Product Drs= 1 – xrx’s/(x’rxr)1/2(x’sxs)1/2 shown store and retrieve mass-spectra • using cosine distance, and it scales
mSQL Query for Protein Identification by Mass-Spec. Signature Database Look SELECT Prot.accesion_id, Prot.sequence FROM protein_sequences Prot, digested_sequences DS, mass_spectra MS WHERE MS.enzyme = DS.enzyme = E and Cosine_Distance(S, MS.spectrum, range1) and DS.accession_id = MS.accession_id = Prot.accesion_id and DS.ms_peak = P and MPAM250(PS, DS.sequence, range2);
Intermittently, but regularly compile (recluster) the indices O(nlog n), n > 106 Rational drug design: O(log n) finite element solutions to traverse search tree. Make a service call to the grid for these operations only Mirror data contents to minimize I/O Since need is intermittant, one grid serves many MoBIoS servers Still benefit from grid-services: recluster MoBIoS Server New index Shape match (FEM) Distance(real) High speed I/O Mirror DB-Contents
Hyper-planes [Ulhmann91] • If d(x,h1) < d(x,h2) then x assigned to h1 h1 x h2
Develop a Hierarchical Clustering Hierarchy of Bounding spheres, (center, radius), • Bounding spheres may overlap • Inspired by R-trees C A E B D F