570 likes | 871 Views
XML Databases in BMI. UCONN Spring 2008, CSE 300: BMI. taught by: Prof. Steve Demurjian. presented by: James Lindsay. <ClinicalDocument xmlns:xsi=" http://www.w3.org/2001/XMLSchema-instance " xmlns:mif="urn:hl7-org:v3/mif" xmlns="urn:hl7-org:v3"> <realmCode code="US"/>
E N D
XML Databases in BMI UCONN Spring 2008, CSE 300: BMI taught by: Prof. Steve Demurjian presented by: James Lindsay <ClinicalDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:mif="urn:hl7-org:v3/mif" xmlns="urn:hl7-org:v3"> <realmCode code="US"/> <typeId root="2.16.840.1.113883.1.3" extension="POCD_HD000040"/> <!-- Conformant to NHSN Generic Constraints --> <templateId root="2.16.840.1.113883.3.117.1.1.1" /> <!-- Conformant to the NHSN Constraints for BSI Numerator Report --> <templateId root="2.16.840.1.113883.3.117.1.1.3.1" /> ... </ClinicalDocument>
Native XML DBMS Pros / Cons. Biomedical Information BMI Databases Overview, XML. HL7 and CDA Overview, examples. Examples of BMI XML. UCONN BMI XML. Survey of Technology. Overview • What is XML: • Overview, tags, schema. • XML query languages: • XPath XQuery. • XML data models: • Data/document -centric, biomedical data. • Storage Strategy + XML DBMS: • Relational, CMS, native.
XML overview • eXtensible Markup Language • Similar to HTML • Meta-language that describes the content of the document (self-describing). • XML is primarily used as a data storage and interchange medium. • XML exists in plain text format, however it may be compressed, or altered for transfer.
XML overview cont. • There are no predefined data (tags), or grammer inherently in XML. • XML tags give an XML document structure and meaning. • Available tags are defined by a schema. • All tags in an XML document come in pairs, open and close. • Tags are completely nested, and there is no ambiguity in their order.
XML tags • XML tags may have an element field which is used to store information within the tag. Meta-data. • Plain text can be placed between tags. This text is not parsed. • CDATA is character data. This means that any string of non-markup characters is legal as part of the attribute. • The ENTITY attribute type indicates that the attribute will represent an external entity in the document itself. • The ID attribute type if you want to specify a unique identifier for each element.
XML Schema • The structure of an XML document is defined by its schema. • Dozens on languages to define XML schema: • DTD • W3C (XSD) • NG - Relax • This file can validate any instance of an XML document against it self. • This file, or schema also defines allowable tags.
Schema Example (XSD) <?xml version="1.0" encoding="ISO-8859-1" ?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="shiporder"> <xs:complexType> <xs:sequence> <xs:element name="orderperson" type="xs:string"/> <xs:element name="shipto"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="address" type="xs:string"/> <xs:element name="city" type="xs:string"/> <xs:element name="country" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="item" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="title" type="xs:string"/> <xs:element name="note" type="xs:string" minOccurs="0"/> <xs:element name="quantity" type="xs:positiveInteger"/> <xs:element name="price" type="xs:decimal"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="orderid" type="xs:string" use="required"/> </xs:complexType> </xs:element> </xs:schema>
XML Structure • XML employees a tree structure model for representing data. (previous slide) shiporder shipto orderperson orderid address country city name item title name quantity price
Querying XML - XPath • Many languages to query XML. We'll focus on XPath and XQuery as they are W3C standards. • Xpath is a compact method of traversing previous tree. • Designed to facilitate use via URL/URI's. • /shiporder/item/name ← view all items' names • Extensible to add user defined behaviors. • Treats each tag as a node in the tree.
Querying XML - XQuery • Functional extension of XPath • XML equivalent of SQL • Navigate and manipulate document nodes. • Works on collections of documents, or even fragments. FOR $b IN document("bib.xml")//book WHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998" RETURN $b/title
XML Models • Naively there are two models of XML use: • Data-centric • Document-centric • In reality, most XML use is a hybrid of the two. • More important is the database strategy used with XML. • Relational • Content Managment • Native XML
Data – centric model • Information is generally stored in a relational database. • XML is transport medium, nothing more. • Irrelevent to application that data exists as XML for some period of time. • Characteristics: • Fine grained data. • Data relationship is insignificant. • Need to transfer relational information. • Means of storing new information.
Document – centric Model • When XML is utilized soley as a document. (This pesentation in Open Office). • The documents in part, or in full are stored and retrived. • Does not originate from relational database. • Document used for human consumption. • Usually information written by hand in a language like PDF, RTF then converted to XML.
Reality: Hybrid Model • Most documents like a PDF will also contain small grained information (last edited date, character set). • Data from a relational DB may even be a document, or require self description. • Various database technologies support all models. • Important to understand your data, and choose db technology that is most compatible.
Medical Data Model • Medical data is non-homogeneous. • But, there exists general trends in medical data: • Fine grain data such as dates, times, images. • Documents and human generated descriptions and observations. • Human interaction creates semi-structured data. • Ability to transfer information is esential. • Medical data fits into hybrid model.
Data – centric Comparison • Advantages: • Utlizes existing database software. (IBM, Oracle, MS) • Quick ( existing db's are already fast). • Dual role (not limited only to XML). • Many even support XQuery • Disadvantages: • More configuration (mapping relational -> XML). • Slower when creating complex XML files due to middle step.
Document – entric Comparison • Advantages: • Good integration into workflow. • Document managment made easy. • Collaboration, and web publishing. • Disadvantages: • Not able to extract data from document directly. • Not designed for high availability, high load systems. • Non-uniformity in implementations.
Storage Strategy: Relational • Utilizing a relational database to store XML documents and data is very popular. • In a very data – centric application this approach is intuitive. • Most top tier database applications support XML in some way. • Oracle, SQL server, IBM, etc... • Software is highly supported and well developed.
XML Shema mapping • Using a relational DB requires mapping XML schema to DB schema. • Table based: • Often implemented as a middleware layer. • Schema structure must follow row-column convention. • Object – relational: • XML is a tree of objects. • Mapped to DB using well established OR methods. • Natively supported in some DB apps.
Storage Strategy: CMS • Used in exclusively document-centric model. • Various programs allow indexing, storage, manipulation, and publication of XML documents. • Application specific. • Numerous implementations, most recently Open Office and MS Word 2007. • Not very interesting or useful in context of biomedical information.
Storage Strategy: Native • Semi – structured data. • Mapping to relational DB causes inflation and null space. • Need more functionality and granularity than CMS • Performance increase over relational DB by avoiding joins. • Assuming data is in appropriate order on disk. • Only returns XML, need to convert for non XML manipulation. • Development still in infancy as of Winter 2007.
Native XML Databases • Definition: • ”A database that has an XML document as its fundamental unit of (logical) storage and defines a (logical) model for an XML document, as opposed to the data in that document, and stores and retrieves documents according to that model. At a minimum, the model must include elements, attributes, PCDATA, and document order.” • Data types: No support in XML, need a mapping. • Document or database schema can be used. • External user defined mapping. • Not necessary when only transfering data. • No requirement on underlying medium or implementation. • Two architectures; text and model based.
Native: Text-based • Use any DB. • Rather than mapping schemas, store entire XML documents. • Usually involves saving entire document as a BLOB / Character LOB. • Utilize various text field searches to retrieve info from XML document. • Some DB text searching are being made XML aware. • Speed: Document located on disk preferences full or partial document retrieval.
Native: Model-based • Internal object model of the document schema. • Store this model in a database. • Relational / object-oriented database. • Proprietary. • Performance similar to chosen db engine. • Still limited by hierachy of XML data. • Retrieve all orderid's from hundreds of docs slow. • Support for common XML query languages • XPath, XQuery, etc...
Native XML: TLC • In the traditional database world, Transactions, locking and concurrency are paramount. • Native XML databases aren't mature enough to support everything. • Most support transactions, but what about LC? • Document level locking is easy, but too coarse. • Only a few implementations support node level locking. • Commercial products generally support ACID, free ones just starting too (2008).
Native XML: API's • Ubiquity of ODBC interfaces. • Still applies to native XML databases. • Most implementations provide their own interface for a variety of languages. • Industry standardization: • XML:DB API from XML:DB.org, programming language neutral. • JSR 225: Xquery API for JAVA (XQJ). IBM and Oracle.
Native XML: The Rest • Referential integrity is supported in an adhoc manner at best. • Database cannot enforce user defined (via schema) integrity. • Some standard mechanisms allow it. • Eventually both mechanisms will be supported. • Currently relies heavily on application for normalization and integrity. • Certainly a drawback for medical applications.
Native XML: Scalability • Limitation of any DB is time spent seeking HD. • XML only needs to find pointer to head of doc. • Therefore an XML DB should scale well in the context of retrieving data. • The only caviat is if the retrieval breaks the document hierachy. • More pointers must be followed, potentially slowing retrieval greatly. • Where there is money, there is a way.
Biomedical Information • Overview of the field. • Data storage and transfer problem. • XML as a solution. • BMI XML examples. • Next section: Choosing a native DB.
BMI Overview • The convergence of computation and biomedicine. • The NIH BMI Science and Tech Initiative: • Define biomedical computing as a science. • Many sources of information: • Clinical, surgical, genetics, drug design, biology. • Standardization in software. • Algorithm development, high speed computing. • All relieves on efficient storage and transfer of information.
BMISTI: Databases • ”Biomedical computing is entering an age where creative exploration of huge amounts of data will lay the foundation of hypotheses.” ~NIH Director • Problems: • Standards. Terminology, syntax and semantics need to be defined and agreed upon to allow integration of data. • Curation. Database submissions need to be checked and cross-referenced to avoid the transitive propagation of error. • Interoperability. Data should be as consistent as possible across databases so that researchers can compare and contrast it. • Computational and Systems issue: • Utilize and manipulate information. • Procress large volumes of information.
BMI: XML • Data sharing and semantic interoperability. • Case study: Electronic Health Record. • The development and use of an integrated health record for a patient. • Hetergenous data, e.g. clinical, clinical-trial, genomic data. • Primary Obstacle: Proprietary data formats. • Uniformity on technical level: Text file. • Step towards semantic goal.
XML in Clinical Data • HL7 standards organization. • V2: ASCII bar format. example: HL7V3|1|2.02 Message|2.16.840.1.113883.1122^CNTRL-3456|2002081614303516^- ---> 06:00||3.0|2.16.840.1.113883^POLB_IN004410||P|I|ER|ER respondTo|RSP|tel:555-555-5555^^WP entit yRsp|||{FAM^^Hippocrates~GIV^^Harold~GIV^^H~SFX^AC^MD}|tel:555-555-5555^^WP sender|SND|nfs:127.127.127.255 device||2.16.840.1.113883.1122^GHH LAB|{GIV^^An Entit y Name}^L|||tel:555-555-2005^^H agencyFor representedOrganization||\NOTH\ location|||2.16.840.1.113883.1122^ELAB-3|{^^GHH Lab}^TN receiver|RCV|nfs:127.127.127.0 device|||2.16.840.1.113883.1122^GHH O E|{GIV^^An Entit y Name}^L|||tel:555-555-2005^^H agencyFor representedOrganization|||2.16.840.1.113883.19.3.1001|{^^GHH Outpatient Clinic}^TN location|||2.16.840.1.113883.1122^BLDG4|{^^GHH Outpatient Clinic}^TN • Awkward, inflexible, unclear meaning of values.
HL7 V3 Specification • Built around Reference Information Model: • Entity, Role, Participation, and Act • Utilizes dedicated vocabularites and data types. • Every specification must begin from RIM. • Clinical Document Architecture • Utilizes XML with tags like ”observation, code, value and id”. <observation classCode="OBS" moodCode="EVN"> <id root="10.23.4573.15879"/> <code code="313193002" codeSystem="2.16.840.1.113883.6.96" codeSystemName="SNOMED CT" displayName="Peak flow"/> <effectiveTime value="20000407"/> <value xsi:type="RTO_PQ_PQ"> <numerator value="260" unit="l"/> <denominator value="1" unit="min"/> </value> </observation>
XML in Clinical Trials • Example: Drug studies • Utilizing XML would eliminate manual transcription when moving data from one system to another. • XML is a universal datatype as it stores everything in text. • Therefore can handle new tech. seamlessly. • Clinical Data Interchange Standards Consortium. • Industry standardization.
CDISC: ODM • Operational Data Model: • XML based. • Facilitate moving data from any collection system to clinical trial sponsor. • Addresses real world issues: • Incomplete data • Partial data transfer • Versioning and branching. • ODM 1.1 current version.
XML in Genomic Data • Various groups export their data in XML • NCBI, EBI • They do not follow same schema, only allows partial semantic interoperability. • Microarray Gene Experssion Group (MAGE) publishes a schema. • MAGE files are often several gigabytes. • Illustrates overhead of XML, however researches still use it because of interoperability.
XML Complexity • Clinical Genomics Special Interest Group (HL7) • Use genomic data in clinical enviroment. • Utilize several models such as MAGE, BSML (for dna seqs) • All information in raw models not necessary. • ”Bubbling up” analyzes large raw data sets, extracts useful information. • Transfer useful information to new schema / model. • Bottom line, there exists complex workflows to extract usable information.
XML BMI Issues • Clinical information like a verbal description or advice is unstructured. • How do you query this? • Schemas and Models are extremely complex, with nesting, recursion and compound data types. • Difficult mapping to relational databases. • XML instances may be gigabytes in size. • What database solutions exist to handle such large files?
XML BMI Examples • A closer look at the Clinical Document Arch. • Mayo clinic's implementation of CDA. • Case study using native XML database to facilitate research based upon clinical texts. • Tamino XML DB. • Querying native BD. • UCONN BMI, CSE 300 Spring 2008
XML BMI: CDAwww.hl7.de/iamcda2004/finalmat/day1/Calvin%20Beebe%20CDA%20Update.pdf • A clinical document is: • Persistence: exists for a defined time period. • Stewardship: Maintained by a designated care taker. • Potential for authentication: May be legally authenticated. • It must be human readable on a standard web browser. • Utilizes standard XML syntax
XML BMI: CDAwww.hl7.de/iamcda2004/finalmat/day1/Calvin%20Beebe%20CDA%20Update.pdf • Mayo clinics use of CDA:
A Native XML Database Design for Clinical Document ResearchJohnson, Campbell, et. al • Facilitate research, especially research on clinical text. • User needs to be accounted for: • Process queries against text. • Process queries against annotations. • Standard method for querying. • Non-heirachical document selection (by patient, date,...) • Return varying level of document granularity. • A schema which adapts to new information without breaking old query formulations. • A schema which adapts to new annotations.
cont. • Tamino XML DBMS: A commercial product. • Supports XQuery, text search which address many of the querying needs. • Utilizes the CDA for structuring meta-information. • A schema structures documents on sentance by sentance level. • Allows high level of granularity. • Tags to link words to sementic and vocabulary library.
UCONN BMI • Utilize a native XML DB to store docuemnts. • Documents could be PHR, health data / statistics, or system meta-data (registration). • Our goal is to provide secure submission and retrieval of a variety of XML data. • For spring 2008, only focusing on submitting registration data.
UCONN BMI: Overview • Current state: Browser: HTML Form Create XML document Submit to DB Java Server User HTML Java XML • Data exists in three different domains: • It is in HTML, a text datatype when the user enters it. • The server maps the html to java strings to create the XML. • The XML is written to a file on the server, and submitted to the database via a java API.
UCONN BMI: Problems • There are 2 transformations of data. • Each requires a hand coded mapping. • This leads to sloppy code, wasted resources. • Only does XML as input, what about output? • The database is obtuse (sedna), what other options exists? • Do we want to store / transmit application data?
UCONN BMI: Model (potential) • Utilize client side JS to create XML. • Use java API to manipulate XML. • Problems: • Document verified through schema, and Xquery. • Awkward to cross reference input with any other data. • Advantages: • No server side data type conversion. • This model applies to user driven input and systems interactions. js -> XML XQuery System Submit to DB Browser: HTML Form User Java Server HTML Java XML
UCONN BMI: Model retrieval • Client queries in XQuery or predefined query in server. • Server uses API to execute XQuery to DB. • Java Server is given XML document, it can: • Apply java based XSLT and return to requestor. (more reliable) • Return raw document, client side JS applies XSLT. (less server load) • Both XQuery Query Java Server DB User / System JS Java XSLT HTML XML