1 / 32

MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles Eric Neumann, Beyond Genomics Tian Niu, Harvard University Ken Baclawski, Northeastern University. Motifs. DNA Motifs. ========== = ============ === = ===== === = = ===== == =======

najwa
Download Presentation

MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles Eric Neumann, Beyond Genomics Tian Niu, Harvard University Ken Baclawski, Northeastern University

  2. Motifs DNA Motifs ========== = ============ === = ===== === = = ===== == ======= human GCTTGAATTAGACAGGATTAAAGGC TTACTGGAGCTGGAAGCCTTGCCCC -AACTCAGGAGTTTAGCCCCA bovine GCTTGAATTAAATAGGATTAAAGGC TTATCAGGGCTGGGAGCTACACCCC -AACTCCTGAGTTTAGCCCCA mouse GCTTGAATTAGACAGGATTAAAGGC TTAGCAGAGCTGGAAGCCTCACATC TAACTCCCACATTGAGCCCCA | | | | -70 -45 -20 +1 Alignment Profile Functional Significance?

  3. Motif Finding Tools • AlignACE • GIBBS • Consensus • Propsector

  4. The Need for motifML • Information resides at multiple sources • Data follow multiple Structures • Multiple Interfaces Integrated XML view MotifML BioProspector Consensus Gibbs AlignACE

  5. Motif Function • Gene expression regulation that is dependent on activated transcriptional factors • Key element of Gene Networks: Complex analysis of microarrays Transcriptional Factors Regulated Gene Expression + Cis-Elements Associated with a Gene

  6. motifML Goals • to allow the full specification of all experimental information known about motifs • to provide an extensible framework for this annotation and provide a common vehicle for exchanging the motif information • to provide a single document interface to integrate all project information, complete with protocols for network data retrieval.

  7. motifML Design • formal and concise- ontology based • motifML documents easy to create • clarity more important than brevity • use both XML schema and XML DTD

  8. motifML Semantics • Annotation • The collection of features for a given set of sequence(s) that have built in semantics • Features • Characteristics supported by analytic evidence • Analyses • Computational • Experimental

  9. motifML Semantics Ontology Property Semantically Definable & Searchable Pragmatic Objects Annotation Analyses Features Motifs Intentional Extraction Results

  10. motifML Sequence Item <seqid=“demo_seq” name=“Human HAL Gene Exon 18”> <dbxref> <database>GenBank</database> <unique_id>14588658 </unique_id> </dbxref> <feature> <motif type=“cis-regulatory” name=“CBE” id=“dm312”/> <description> CRX Binding Element </description> <position start=“21” end=“32” /> <evidence> <reference paper=“Davies, J Mol Biol. 1993 296:1205-14”/> </evidence> </feature> <residuestype=“dna”> ATAATGTCCAAGATCTTCTGGAGAGTGTATCCCATGCTGTGGAGCACTCTGTGGAAGCCACGGGTCCTTTAGACAGCTCATCCTATGAGGAGCACTTCTTAACTGGCACTGGTCTCTTGCAGTTTCTGAGAACAAGGCTCTGTGCCATCCCTCGTCTGTTGACTCCCTCTCCACCAGCGCAGCCACGGAGGACCACGTCTCCATGGGAGGATGGGCAGCAAGGAAAGCCCTCAGGGTCATCGAGCATGTGGAGCAAGGTAATGCTGATGAGTTCGGGGTGGCGGGCCTGCCTGATAGACCACTGTGCCTGTGGTTCTCAAGTGGGATCTCCCACCAGCAACATCAGCATC ACCTGGAAAC </residues> </seq>

  11. Computational Analysis <!ELEMENT computational_analysis (date?, program, version?, parameter*, database?, result_set+)> <!ATTLIST computational_analysis seq IDREF #REQUIRED> <!ELEMENT program (#PCDATA)> <!ELEMENT result_set (score?, output*, result*)> <!ELEMENT result (score, type, subtype?, seq_relationship+, output*)> <!ATTLIST result id ID #IMPLIED> <!ELEMENT seq_relationship (location, alignment?)> <!ATTLIST seq_relationship seq IDREF #REQUIRED type (query | subject | peer ) #REQUIRED> <!ELEMENT alignment (#PCDATA)> <!ELEMENT type (#PCDATA)> <!ELEMENT value (#PCDATA)> <!ELEMENT parameter (type, value)> <!ELEMENT output (type, value)> <!ELEMENT database (name, date?, version?)> <!ELEMENT version (#PCDATA)> <!ELEMENT score (#PCDATA)>

  12. HSP and HSE • Heat shock and other environmental and pathophysiologic stresses stimulate synthesis of heat shock proteins (Hsps). These proteins enable the cell to survive and recover from stressful conditions by as yet incompletely understood mechanisms. • A conserved 14 base pair regulatory sequence, referred to as the heat shock element (HSE), is found in multiple imperfect copies upstream of the TATA box of all heat shock genes. • Genes with an HSE at the upstream region may be co-regulated

  13. Dataset (Vertebrates)* • > gid 3004462, start=1, end=1027 • > gid 7861931, start=1, end=666 • > gid 7108904, start=1, end=1519 • > gid 7739662, start=1, end=800 • > gid 64795, start=1, end=487 • > gid 64791, start=1, end=614 • > gid 64789, start=1, end=1128 • > gid 64786, start=1, end=374 • > gid 32480, start=1, end=483 • > gid 32484, start=1, end=711 • > gid 7669470, start=1, end=424 • > gid 5729878, start=1, end=313 • > gid 5031770, start=1, end=760 • > gid 1816451, start=1, end=2179 • > gid 184422, start=1, end=2634 • > gid 184416, start=1, end=488 • > gid 188491, start=1, end=959 • > gid 4691417, start=1, end=2631 • > gid 188489, start=1, end=485 • > gid 188487, start=1, end=489 • > gid 184416, start=1, end=488 • > gid 211940, start=1, end=391 • > gid 63508, start=1, end=1421 • > gid 63512, start=1, end=2300 • > gid 409185, start=1, end=1231 • > gid 163160, start=1, end=491 • > gid 414974, start=1, end=426 *Data are from GenBank

  14. AlignACE program • uses a Gibbs sampling strategy which is similar to that described by Neuwald et al., 1995 • An iterative masking procedure is used to allow multiple distinct motifs to be found within a single data set • Reference: Hughes et al., J Mol Biol. 2000 296:1205-14

  15. AlignACE Results ... Motif 1 GGGGAGGGGGTGGGGGGGC 23 788 0 GGCGGGCGGGCGGCGGGGG 23 867 1 GGACAGCGGCGGCTGGCTG 11 107 0 GGGGTGCGGGGGCAGGCGC 23 1417 1 CCGCGGGGGCGGGCGGGGC 13 2034 1 ... ** * ***** ** *** * MAP Score: 794.004 Motif 2 GGGGAGGGGGTGGGGGGGCGGGG 23 784 0 GTGCGGGGGCAGGCGCGGAGAGC 23 1420 1 GCGGAGCGGGAGGGGGCGTGGCC 13 1932 1 GGGGTGCGGGAGGGCGGGCGGGC 23 1448 1 GGGCAGTGGGCGGCTGGCAGCTG 14 1452 1 ...

  16. Gibbs Motif Sampler Program • Uses Stochastic Iterative Sampling • The Bernoulli motif sampler assumes that each sequence can contain zero or more ungapped motif elements of each motif type • Reference: • Lawrence et al., Science 1993;262(5131):208-14; • Neuwald et al., Protein Sci. 1995 Aug;4(8):1618-32.

  17. Gibbs Results ... 4, 1 284 agtgc AGAGTCTGGAGAGC cgaat 271 0.87 R gid 7739662, start=1, end=800 4, 2 425 ggtat AGATGTCGGAGAGT cgttt 412 0.79 R gid 7739662, start=1, end=800 4, 3 643 atgga AGCCTCGGGAAACT tcggg 656 0.86 F gid 7739662, start=1, end=800 5, 1 239 atgga AGCCTCGGGAAACT tcggg 252 0.86 F gid 64795, start=1, end=487 7, 1 401 agtgt GGGTGCTGGAGGCT gacgg 388 0.99 R gid 64789, start=1, end=1128 9, 1 26 ggagt GGCGGTGGGAAGGG tgttg 13 0.99 R gid 32480, start=1, end=483 ... ************** ...

  18. Consensus Program • Uses entropy-based scoring functions • References: • Stormo and Hartzell, PNAS 1989;86:1183-1187 • Hertz et al., 1990, CABIOS, 6:81-92

  19. Consensus Results MATRIX 1 ... 1|23 : 1/593 TGCAAGATTTTTAA 2|9 : 2/8 TGGAGGCTTCCAGA 3|10 : 3/889 TGGAGGCTTCCAGA ... MATRIX 2 ... 1|23 : 1/593 TGCAAGATTTTTAA 2|9 : 2/8 TGGAGGCTTCCAGA 3|10 : 3/889 TGGAGGCTTCCAGA ... MATRIX 3 1|23 : 1/593 TGCAAGATTTTTAA 2|9 : 2/8 TGGAGGCTTCCAGA 3|10 : 3/889 TGGAGGCTTCCAGA ... MATRIX 4 1|21 : 1/38 GGGAAAGCTCGAGA 2|9 : 2/8 TGGAGGCTTCCAGA 3|10 : 3/889 TGGAGGCTTCCAGA ...

  20. BioProspector Program • a program that examines the upstream region of genes in the same gene expression pattern group to search for regulatory sequence motifs. • uses zero to third-order Markov background models • allows for the searching of gapped motifs and motifs with palindromic patterns • Reference: Liu et al., Pac Symp Biocomput. 2001:127-38

  21. BioProspector Results ... Motif #1: ... Seq #1 seg 1 r998 TCATCCAATCAGAG Seq #2 seg 1 f91 TCAACCGAACAGAA Seq #3 seg 1 r638 TCGACCAATCAAAA ... Motif #2: ... Seq #1 seg 1 f38 GGGAAAGCTCGAGA Seq #2 seg 1 r648 TGGAAGCCTCCAGT Seq #3 seg 1 r620 TGGAAGCCTCCAGT ... Motif #3: ... Seq #1 seg 1 r997 CTCATCCAATCAGA Seq #2 seg 1 f90 CTCAACCGAACAGA Seq #3 seg 1 r637 TTCGACCAATCAAA ...

  22. Gibbs AlignACE BioProspector Information Content CONSENSUS Conceptions and Interactions of the Underlying Statistical Algorithms Used by the Motif Searching Programs Gibbs Sampler; Iterative Updating Strategy Two Block Motif Model

  23. Motif Data Representation • Common data representation for motif information. • Uses XML Schema to specify format. • Both human and machine readable. • Supports “knowledge mining”. • Statements can be asserted about a motif such as a role in gene regulation.

  24. Example of a motif <motif id="GXY1"> <block> <base type="G">0.21</base> <base type="C">0.21</base> <base type="T">0.59</base> </block> <block> <base type="G">0.44</base> <base type="C">0.50</base> <base type="T">0.06</base> </block> <block> <base type="A">0.70</base> <base type="G">0.29</base> </block> ... </motif> Blk1 A G C T 1 0.00 0.21 0.21 0.59 2 0.00 0.44 0.50 0.06 3 0.70 0.29 0.00 0.00 4 0.32 0.62 0.00 0.06 5 0.03 0.00 0.97 0.00 6 0.00 0.00 1.00 0.00 7 0.85 0.09 0.03 0.03 8 0.88 0.12 0.00 0.00 9 0.03 0.00 0.03 0.94 10 0.03 0.09 0.88 0.00 11 0.70 0.12 0.18 0.00 ...

  25. XML Schema • Extends the XML document type language: • Data format restrictions. • Data value (min and max) restrictions. • Element occurrence (min and max) restrictions. • No sophisticated restrictions: • Probability distribution.

  26. XML Schema for MotifML <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="motif" type="MotifType"/> <!-- A motif consists of a sequence of blocks. --> <xsd:complexType name="MotifType"> <xsd:sequence> <xsd:element name="block" minOccurs="0" maxOccurs="unbounded" type="BlockType"/> </xsd:sequence> </xsd:complexType> <!-- A block specifies a probability for each DNA base type. --> <xsd:complexType name="BlockType"> <xsd:sequence> <xsd:element name="base" minOccurs="1" maxOccurs="4"> ...

  27. Statements about motifs <?xml version="1.0"?> <RDF xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:mml="http://www.beyondgenomics.com/2001/07/motifml#" xmlns:bp="http://www.beyondgenomics.com/2001/07/biopathway#"/> <Description about="http://www.beyondgenomics.com/motifdb/gxy1"> <bp:upregulate rdf:resource="http://www.beyondgenomics.com/motifdb/awy5"/> <bp:upregulate rdf:resource="http://www.beyondgenomics.com/motifdb/ftg6"/> <bp:downregulate rdf:resource="http://www.beyondgenomics.com/motifdb/bgt3"/> </Description> </RDF>

  28. The Need for Bio-Ontologies • How do biologists learn the element structure of a document describing the heterogeneous sequence alignment output? • How do biologists share the structure and meta-data on motif profiles efficiently and unambiguously?

  29. TRANSFAC A multiple sequence alignment linked with TRANSFAC/TRANSPATH ========== = ============ === = ===== === = = ===== == ======= human GCTTGAATTAGACAGGATTAAAGGC TTACTGGAGCTGGAAGCCTTGCCCC -AACTCAGGAGTTTAGCCCCA bovine GCTTGAATTAAATAGGATTAAAGGC TTATCAGGGCTGGGAGCTACACCCC -AACTCCTGAGTTTAGCCCCA mouse GCTTGAATTAGACAGGATTAAAGGC TTAGCAGAGCTGGAAGCCTCACATC TAACTCCCACATTGAGCCCCA PCE-I -CBE-- AP-4 8888 cETS cETS | | | | -70 -45 -20 +1 Alignment Profile Shown here is the alignment from -70 to +1. The numbering shown corresponds to the mouse sequence. Identical bases are shown by the = above each nucleotide. Consensus sequence matches conserved among all three species are: the Ret-1/PCE-I element at -65 to -60, the CRX-binding element (CBE) at -55 to -50, an AP-4 consensus core sequence at -37 to -34, a cETS consensus core at -35 to -31 and another at positions -57 to -54, and an S8 homeodomain is shown by "8888" at -64 to -61. Only the core bases are marked. The criteria for searching the TRANSFAC Database by MatInspector were a match to the core sequence of at least 80% and to the entire consensus sequence of at least 85%. The Genbank entries for human, bovine, and mouse are X53044, M32733, and M32734, respectively. (Boatright, Mol Vis 1997; 3:15)

  30. Transcriptional Factors Ontology Composite Element contains Upstream to Site Gene Within • Tissue • Stage • Disease • Env.Cond. • Induced Kind of Context produces Transcript Part of Found in Observation Transcriptional Motif Elements Transcriptional Factors Binds to

  31. MotifML Applications • Develop a data exchange format for DNA motif data • Handling output from motif analyses • Annotation and data mining of micro-array data • Important in modeling transcriptional regulatory networks in eukaryotes

  32. Future Directions • Distributed Annotation System –Lincoln Stein, Open-Bio • Exchange with Other XML Dialects • DAML development

More Related