XML Databases in BMI

XML Databases in BMI UCONN Spring 2008, CSE 300: BMI taught by: Prof. Steve Demurjian presented by: James Lindsay <ClinicalDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:mif="urn:hl7-org:v3/mif" xmlns="urn:hl7-org:v3"> <realmCode code="US"/> <typeId root="2.16.840.1.113883.1.3" extension="POCD_HD000040"/>  <templateId root="2.16.840.1.113883.3.117.1.1.1" />  <templateId root="2.16.840.1.113883.3.117.1.1.3.1" /> ... </ClinicalDocument>

Native XML DBMS Pros / Cons. Biomedical Information BMI Databases Overview, XML. HL7 and CDA Overview, examples. Examples of BMI XML. UCONN BMI XML. Survey of Technology. Overview • What is XML: • Overview, tags, schema. • XML query languages: • XPath XQuery. • XML data models: • Data/document -centric, biomedical data. • Storage Strategy + XML DBMS: • Relational, CMS, native.

XML overview • eXtensible Markup Language • Similar to HTML • Meta-language that describes the content of the document (self-describing). • XML is primarily used as a data storage and interchange medium. • XML exists in plain text format, however it may be compressed, or altered for transfer.

XML overview cont. • There are no predefined data (tags), or grammer inherently in XML. • XML tags give an XML document structure and meaning. • Available tags are defined by a schema. • All tags in an XML document come in pairs, open and close. • Tags are completely nested, and there is no ambiguity in their order.

XML tags • XML tags may have an element field which is used to store information within the tag. Meta-data. • Plain text can be placed between tags. This text is not parsed. • CDATA is character data. This means that any string of non-markup characters is legal as part of the attribute. • The ENTITY attribute type indicates that the attribute will represent an external entity in the document itself. • The ID attribute type if you want to specify a unique identifier for each element.

XML Schema • The structure of an XML document is defined by its schema. • Dozens on languages to define XML schema: • DTD • W3C (XSD)‏ • NG - Relax • This file can validate any instance of an XML document against it self. • This file, or schema also defines allowable tags.

Schema Example (XSD)‏ <?xml version="1.0" encoding="ISO-8859-1" ?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="shiporder"> <xs:complexType> <xs:sequence> <xs:element name="orderperson" type="xs:string"/> <xs:element name="shipto"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="address" type="xs:string"/> <xs:element name="city" type="xs:string"/> <xs:element name="country" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="item" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="title" type="xs:string"/> <xs:element name="note" type="xs:string" minOccurs="0"/> <xs:element name="quantity" type="xs:positiveInteger"/> <xs:element name="price" type="xs:decimal"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="orderid" type="xs:string" use="required"/> </xs:complexType> </xs:element> </xs:schema>

XML Structure • XML employees a tree structure model for representing data. (previous slide)‏ shiporder shipto orderperson orderid address country city name item title name quantity price

Querying XML - XPath • Many languages to query XML. We'll focus on XPath and XQuery as they are W3C standards. • Xpath is a compact method of traversing previous tree. • Designed to facilitate use via URL/URI's. • /shiporder/item/name ← view all items' names • Extensible to add user defined behaviors. • Treats each tag as a node in the tree.

Querying XML - XQuery • Functional extension of XPath • XML equivalent of SQL • Navigate and manipulate document nodes. • Works on collections of documents, or even fragments. FOR $b IN document("bib.xml")//book WHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998" RETURN $b/title

XML Models • Naively there are two models of XML use: • Data-centric • Document-centric • In reality, most XML use is a hybrid of the two. • More important is the database strategy used with XML. • Relational • Content Managment • Native XML

Data – centric model • Information is generally stored in a relational database. • XML is transport medium, nothing more. • Irrelevent to application that data exists as XML for some period of time. • Characteristics: • Fine grained data. • Data relationship is insignificant. • Need to transfer relational information. • Means of storing new information.

Document – centric Model • When XML is utilized soley as a document. (This pesentation in Open Office). • The documents in part, or in full are stored and retrived. • Does not originate from relational database. • Document used for human consumption. • Usually information written by hand in a language like PDF, RTF then converted to XML.

Reality: Hybrid Model • Most documents like a PDF will also contain small grained information (last edited date, character set). • Data from a relational DB may even be a document, or require self description. • Various database technologies support all models. • Important to understand your data, and choose db technology that is most compatible.

Medical Data Model • Medical data is non-homogeneous. • But, there exists general trends in medical data: • Fine grain data such as dates, times, images. • Documents and human generated descriptions and observations. • Human interaction creates semi-structured data. • Ability to transfer information is esential. • Medical data fits into hybrid model.

Data – centric Comparison • Advantages: • Utlizes existing database software. (IBM, Oracle, MS)‏ • Quick ( existing db's are already fast). • Dual role (not limited only to XML). • Many even support XQuery • Disadvantages: • More configuration (mapping relational -> XML). • Slower when creating complex XML files due to middle step.

Document – entric Comparison • Advantages: • Good integration into workflow. • Document managment made easy. • Collaboration, and web publishing. • Disadvantages: • Not able to extract data from document directly. • Not designed for high availability, high load systems. • Non-uniformity in implementations.

Storage Strategy: Relational • Utilizing a relational database to store XML documents and data is very popular. • In a very data – centric application this approach is intuitive. • Most top tier database applications support XML in some way. • Oracle, SQL server, IBM, etc... • Software is highly supported and well developed.

XML Shema mapping • Using a relational DB requires mapping XML schema to DB schema. • Table based: • Often implemented as a middleware layer. • Schema structure must follow row-column convention. • Object – relational: • XML is a tree of objects. • Mapped to DB using well established OR methods. • Natively supported in some DB apps.

Storage Strategy: CMS • Used in exclusively document-centric model. • Various programs allow indexing, storage, manipulation, and publication of XML documents. • Application specific. • Numerous implementations, most recently Open Office and MS Word 2007. • Not very interesting or useful in context of biomedical information.

Storage Strategy: Native • Semi – structured data. • Mapping to relational DB causes inflation and null space. • Need more functionality and granularity than CMS • Performance increase over relational DB by avoiding joins. • Assuming data is in appropriate order on disk. • Only returns XML, need to convert for non XML manipulation. • Development still in infancy as of Winter 2007.

Native XML Databases • Definition: • ”A database that has an XML document as its fundamental unit of (logical) storage and defines a (logical) model for an XML document, as opposed to the data in that document, and stores and retrieves documents according to that model. At a minimum, the model must include elements, attributes, PCDATA, and document order.” • Data types: No support in XML, need a mapping. • Document or database schema can be used. • External user defined mapping. • Not necessary when only transfering data. • No requirement on underlying medium or implementation. • Two architectures; text and model based.

Native: Text-based • Use any DB. • Rather than mapping schemas, store entire XML documents. • Usually involves saving entire document as a BLOB / Character LOB. • Utilize various text field searches to retrieve info from XML document. • Some DB text searching are being made XML aware. • Speed: Document located on disk preferences full or partial document retrieval.

Native: Model-based • Internal object model of the document schema. • Store this model in a database. • Relational / object-oriented database. • Proprietary. • Performance similar to chosen db engine. • Still limited by hierachy of XML data. • Retrieve all orderid's from hundreds of docs slow. • Support for common XML query languages • XPath, XQuery, etc...

Native XML: TLC • In the traditional database world, Transactions, locking and concurrency are paramount. • Native XML databases aren't mature enough to support everything. • Most support transactions, but what about LC? • Document level locking is easy, but too coarse. • Only a few implementations support node level locking. • Commercial products generally support ACID, free ones just starting too (2008).

Native XML: API's • Ubiquity of ODBC interfaces. • Still applies to native XML databases. • Most implementations provide their own interface for a variety of languages. • Industry standardization: • XML:DB API from XML:DB.org, programming language neutral. • JSR 225: Xquery API for JAVA (XQJ). IBM and Oracle.

Native XML: The Rest • Referential integrity is supported in an adhoc manner at best. • Database cannot enforce user defined (via schema) integrity. • Some standard mechanisms allow it. • Eventually both mechanisms will be supported. • Currently relies heavily on application for normalization and integrity. • Certainly a drawback for medical applications.

Native XML: Scalability • Limitation of any DB is time spent seeking HD. • XML only needs to find pointer to head of doc. • Therefore an XML DB should scale well in the context of retrieving data. • The only caviat is if the retrieval breaks the document hierachy. • More pointers must be followed, potentially slowing retrieval greatly. • Where there is money, there is a way.

Biomedical Information • Overview of the field. • Data storage and transfer problem. • XML as a solution. • BMI XML examples. • Next section: Choosing a native DB.

BMI Overview • The convergence of computation and biomedicine. • The NIH BMI Science and Tech Initiative: • Define biomedical computing as a science. • Many sources of information: • Clinical, surgical, genetics, drug design, biology. • Standardization in software. • Algorithm development, high speed computing. • All relieves on efficient storage and transfer of information.

BMISTI: Databases • ”Biomedical computing is entering an age where creative exploration of huge amounts of data will lay the foundation of hypotheses.” ~NIH Director • Problems: • Standards. Terminology, syntax and semantics need to be defined and agreed upon to allow integration of data. • Curation. Database submissions need to be checked and cross-referenced to avoid the transitive propagation of error. • Interoperability. Data should be as consistent as possible across databases so that researchers can compare and contrast it. • Computational and Systems issue: • Utilize and manipulate information. • Procress large volumes of information.

BMI: XML • Data sharing and semantic interoperability. • Case study: Electronic Health Record. • The development and use of an integrated health record for a patient. • Hetergenous data, e.g. clinical, clinical-trial, genomic data. • Primary Obstacle: Proprietary data formats. • Uniformity on technical level: Text file. • Step towards semantic goal.

XML in Clinical Data • HL7 standards organization. • V2: ASCII bar format. example: HL7V3|1|2.02 Message|2.16.840.1.113883.1122^CNTRL-3456|2002081614303516^- ---> 06:00||3.0|2.16.840.1.113883^POLB_IN004410||P|I|ER|ER respondTo|RSP|tel:555-555-5555^^WP entit yRsp|||{FAM^^Hippocrates~GIV^^Harold~GIV^^H~SFXÂC^MD}|tel:555-555-5555^^WP sender|SND|nfs:127.127.127.255 device||2.16.840.1.113883.1122^GHH LAB|{GIV^Ân Entit y Name}^L|||tel:555-555-2005^^H agencyFor representedOrganization||\NOTH\ location|||2.16.840.1.113883.1122ÊLAB-3|{^^GHH Lab}^TN receiver|RCV|nfs:127.127.127.0 device|||2.16.840.1.113883.1122^GHH O E|{GIV^Ân Entit y Name}^L|||tel:555-555-2005^^H agencyFor representedOrganization|||2.16.840.1.113883.19.3.1001|{^^GHH Outpatient Clinic}^TN location|||2.16.840.1.113883.1122^BLDG4|{^^GHH Outpatient Clinic}^TN • Awkward, inflexible, unclear meaning of values.

HL7 V3 Specification • Built around Reference Information Model: • Entity, Role, Participation, and Act • Utilizes dedicated vocabularites and data types. • Every specification must begin from RIM. • Clinical Document Architecture • Utilizes XML with tags like ”observation, code, value and id”. <observation classCode="OBS" moodCode="EVN"> <id root="10.23.4573.15879"/> <code code="313193002" codeSystem="2.16.840.1.113883.6.96" codeSystemName="SNOMED CT" displayName="Peak flow"/> <effectiveTime value="20000407"/> <value xsi:type="RTO_PQ_PQ"> <numerator value="260" unit="l"/> <denominator value="1" unit="min"/> </value> </observation>

XML in Clinical Trials • Example: Drug studies • Utilizing XML would eliminate manual transcription when moving data from one system to another. • XML is a universal datatype as it stores everything in text. • Therefore can handle new tech. seamlessly. • Clinical Data Interchange Standards Consortium. • Industry standardization.

CDISC: ODM • Operational Data Model: • XML based. • Facilitate moving data from any collection system to clinical trial sponsor. • Addresses real world issues: • Incomplete data • Partial data transfer • Versioning and branching. • ODM 1.1 current version.

ODM: Layout

XML in Genomic Data • Various groups export their data in XML • NCBI, EBI • They do not follow same schema, only allows partial semantic interoperability. • Microarray Gene Experssion Group (MAGE) publishes a schema. • MAGE files are often several gigabytes. • Illustrates overhead of XML, however researches still use it because of interoperability.

XML Complexity • Clinical Genomics Special Interest Group (HL7)‏ • Use genomic data in clinical enviroment. • Utilize several models such as MAGE, BSML (for dna seqs)‏ • All information in raw models not necessary. • ”Bubbling up” analyzes large raw data sets, extracts useful information. • Transfer useful information to new schema / model. • Bottom line, there exists complex workflows to extract usable information.

XML BMI Issues • Clinical information like a verbal description or advice is unstructured. • How do you query this? • Schemas and Models are extremely complex, with nesting, recursion and compound data types. • Difficult mapping to relational databases. • XML instances may be gigabytes in size. • What database solutions exist to handle such large files?

XML BMI Examples • A closer look at the Clinical Document Arch. • Mayo clinic's implementation of CDA. • Case study using native XML database to facilitate research based upon clinical texts. • Tamino XML DB. • Querying native BD. • UCONN BMI, CSE 300 Spring 2008

XML BMI: CDAwww.hl7.de/iamcda2004/finalmat/day1/Calvin%20Beebe%20CDA%20Update.pdf • A clinical document is: • Persistence: exists for a defined time period. • Stewardship: Maintained by a designated care taker. • Potential for authentication: May be legally authenticated. • It must be human readable on a standard web browser. • Utilizes standard XML syntax

XML BMI: CDAwww.hl7.de/iamcda2004/finalmat/day1/Calvin%20Beebe%20CDA%20Update.pdf • Mayo clinics use of CDA:

A Native XML Database Design for Clinical Document ResearchJohnson, Campbell, et. al • Facilitate research, especially research on clinical text. • User needs to be accounted for: • Process queries against text. • Process queries against annotations. • Standard method for querying. • Non-heirachical document selection (by patient, date,...)‏ • Return varying level of document granularity. • A schema which adapts to new information without breaking old query formulations. • A schema which adapts to new annotations.

cont. • Tamino XML DBMS: A commercial product. • Supports XQuery, text search which address many of the querying needs. • Utilizes the CDA for structuring meta-information. • A schema structures documents on sentance by sentance level. • Allows high level of granularity. • Tags to link words to sementic and vocabulary library.

UCONN BMI • Utilize a native XML DB to store docuemnts. • Documents could be PHR, health data / statistics, or system meta-data (registration). • Our goal is to provide secure submission and retrieval of a variety of XML data. • For spring 2008, only focusing on submitting registration data.

UCONN BMI: Overview • Current state: Browser: HTML Form Create XML document Submit to DB Java Server User HTML Java XML • Data exists in three different domains: • It is in HTML, a text datatype when the user enters it. • The server maps the html to java strings to create the XML. • The XML is written to a file on the server, and submitted to the database via a java API.

UCONN BMI: Problems • There are 2 transformations of data. • Each requires a hand coded mapping. • This leads to sloppy code, wasted resources. • Only does XML as input, what about output? • The database is obtuse (sedna), what other options exists? • Do we want to store / transmit application data?

UCONN BMI: Model (potential)‏ • Utilize client side JS to create XML. • Use java API to manipulate XML. • Problems: • Document verified through schema, and Xquery. • Awkward to cross reference input with any other data. • Advantages: • No server side data type conversion. • This model applies to user driven input and systems interactions. js -> XML XQuery System Submit to DB Browser: HTML Form User Java Server HTML Java XML

UCONN BMI: Model retrieval • Client queries in XQuery or predefined query in server. • Server uses API to execute XQuery to DB. • Java Server is given XML document, it can: • Apply java based XSLT and return to requestor. (more reliable)‏ • Return raw document, client side JS applies XSLT. (less server load)‏ • Both XQuery Query Java Server DB User / System JS Java XSLT HTML XML

XML Databases in BMI

XML Databases in BMI

Presentation Transcript

XML and Databases

XML Databases in BMI

XML and Databases

XML and Databases

Historical XML Databases

XML Databases

XML Databases

XML Databases

XML and Databases

Native XML Databases

XML and Databases

XML and Databases

XML and Databases

XML and Databases

XML and Databases

XML and Databases

Native XML Databases

XML Databases

XML and Databases

XML and Databases

XML Databases

XML and Databases