380 likes | 526 Views
Semantic Modeling of Biological Sequences. Sudha Ram Eller Professor Department of Management Information Systems Eller School of Management The University of Arizona March 5, 2004. Road Map. Background Semantics of DNA sequences and Primary protein structures
E N D
Semantic Modeling ofBiological Sequences Sudha Ram Eller Professor Department of Management Information Systems Eller School of Management The University of Arizona March 5, 2004
Road Map • Background • Semantics of DNA sequences and Primary protein structures • Semantics of 3-D protein structures • Summary and Future Work
Background • Human Genome Project (HGP) started 1990 by Department of Energy • To sequence the 24 distinct chromosomes comprising the human genome • Completed in April, 2003 – earlier than expected. • Achievements: • Determined the complete sequence of 3 billion DNA subunits, identified all human genes • Stored all the data in databases
Post-Genomic Era “New generalizations and higher order biological laws are being approached but may be obscured by the simple mass of data” ---Morowitz et. al. 1987
More Challenges • Usage and analysis of the data requires: • Ad hoc and complicated queries • Efficient data browsing and retrieving • Integrated data sources • Effective and user-friendly data presentation Find all genes that are structurally similar to a given gene and expressed similarly over a specific DNA microarray dataset
Current Databases • Major DNA sequence databases: • GenBank (Gene Bank) • DDBJ (DNA Data Bank of Japan) • EMBL (European Molecular Biology Laboratory) • Other databases: • Different Types • Different Scales • Different Models --Bioinformatics Databases and Systems
Current Data Models • Data models: • Flatfile (ASN.1) • Relational • XML and its extensions (BSML) • Others • Drawbacks?
Research Motivation • Usage and analysis of the data requires: • Ad hoc and complicated queries • Efficient data browsing and retrieving • Integrated data sources • Effective and user-friendly data presentation • Existing sequence/structure databases not able to provide these capabilities: • Flatfile format hides semantics of data • Relationships/hierarchies are not clear • Don’t support ad hoc and complicated queries
DNA Sequences • Linear Sequences • DNA sequences • Genetic information carrier • Composed of nucleic acids • Primary protein sequences • Composed of amino acids
Protein Building Blocks • Proteins are the most important macromolecules in the factory of living cells that perform various biological tasks • A protein is composed of 20 kinds of amino acids, also known as subunits or residues
Protein Structures • Protein 3-D structures • Intermolecular and intramolecular chemical forces force the linear primary sequence to be folded into 3-D structures to reach the minimum energy/most stable state • Structures determine properties or functions
Levels of Protein Structures—I • Primary (Linear): each building block (amino acid) can be represented by a letter (of the English alphabet) • Secondary: The chain of covalently linked amino acids is further organized by forming regularly repeating patterns due to hydrogen bondings
Levels of Protein Structures—II • Tertiary: Alpha helices and beta sheets fold themselves further into a "chain", cross-linking with one another via their side chains. • Quaternary: For proteins with more than one chain, interaction can occur between the chains themselves.
Previous Work • “A sequence is a mapping between a collection of similarly structured records and the positions of an ordering domain” ----Seshadri et. al., 1995 • Various sequences are just different Ordering Domain and Collection of Records combinations
Time Sequences—I • “For time is just this—number of movements in respect to the ‘before’ and ‘after’”—Aristotle • We want to capture attributes of the movements • We want to know the order/time of the movements • Time is continuous • Temporal databases: deals with semantics of ordered sequences of data values in the time domain.
Time Sequences—II • “Time sequence is basically the sequence of values in the time domain for a single entity instance” ---Segev et. al., 1987 • Time sequences can be: • Step-wise constant • Discrete • Continuous
Biological Sequences—III • Basic model for sequence can be adapted:
Other Sequences • Process Sequences • Sequences of processes and subprocesses • Multimedia sequence • Streams of multimedia data • Image • Audio
Why a New Model for Biological Sequences? • Time sequences are continuous, biological sequences are discrete • More semantics in biological sequences such as sequences and their subsequences • In time sequences, some time points don’t have data, in biological sequences, each position has its own data • Usage of biological data requires that the sequence data be represented and analyzed in different ways
Relation to Gene Ontologies? • Ontology defines biological properties associated with sequence data , however we model semantics of sequence data • No protein structure ontology exists • Both contribute to database integration.
What about Relational technology? • Relational technology doesn’t support data structures as complicated as biological sequences • Hierarchy and semantics are hidden in relations We can never emphasize semantics too much!
DNA Sequences Semantic Model • DNA sequences and primary protein sequences: Ram, S. and Wei, W., Semantic Modeling of Biological Sequences. in Thirteenth Annual Workshop On Information Technology and Systems (WITS'03), Seattle, Washington, December 2003.
Entity Classes - I • ATOMS • Superclass of families of atoms • Collection of atomic components of biological sequence • Domain: Possible components • LINEARORDER • Set of positions (integers) in the sequence • Domain: (1, j) where j is the length of the sequence
Entity Classes - II • SEQUENCES • Ordered list of (ATOM, LINEARORDER) pairs • SUBSEQUENCES • Part of a sequence • Associated with biological activities
New Constructs-I • Sequential Aggregate • It is aggregation of ATOMS and LINEARORDER • It is sequential because order matters • Normal Aggregate • To indicate whole-part relationship • Example: Course and students
New Constructs - II • Fragment • Sequences are segmented • Fragments can overlap
Relationships • Ternary Sequential Aggregation • Fragment:
How many subsequences are fragmented from a specific sequence Find a particular sequence and display a segment from 2nd to 200th Find all the sequences that share one or more specific subsequences Utility of DNA Sequence Model • Semantics of sequence data captured • Ad hoc queries are possible
Protein Databases • The Protein Data Bank, PDB (http://www.rcsb.org/pdb/) is the only worldwide archive of experimentally determined three-dimensional structures of proteins. • Data stored in flatfiles • This format records primary and secondary structure of proteins using groups of coordinates. It does not record the tertiary and quaternary structures. No relationships among structures at different levels is captured.
Entity Classes • ATOMS: This entity class is used to model chemical atoms (C, H, O, N etc) in the protein structure with each of them identified uniquely. • RESIDUES: This entity class represents amino acid subunits, which are the basic building blocks of protein structures. • PRIMARY STRUCTURE • SECONDARY STRUCTURE • TERTIARY STRUCTURE • QUARTERNARY STRUCTURE
Relationships—I • Spatial-Aggregate • P: represents a point using x, y and z coordinates in degrees • T is the temperature at which the structure is determined
Relationships—II • Sequential-Aggregate • LL is the list length • X is the position of the residue in the list • Position of any atom has to be less than or equal to the length
Relationships—III • Spatial-Bonding • Represent the strength and length of the chemical forces among atoms • By describing the semantics of these bonds at each level using additional annotations, we can differentiate between these bonds as they apply to different levels of protein structures
Relationships—IV • An example of annotated relationship • For secondary structures • B: Bond • BE: Bond energy • BL: Bond length
Utility of Protein Structure Model Find the sequence of amino acids for this protein structure Find a set of forces similar to this, and the resulting 3-D structure? Give me all the hydrogen bondings that contribute to the secondary structure
New Operators based on Semantics • Sequence • Subsequence • Aggregate • Comparison of Sequences and Subsequences -- Identical -- Similar -- Partial • Allen’s Predicates: Before, After, Meets, During, Starts, Finishes, Contains, Overlaps.
Future Research • Our ultimate goal is biological sequence database integration • Additional semantics constructs • Semantic reconciliation among databases • Case studies