190 likes | 390 Views
A Methodology for Implementing a Distributed Storage System for Structured XML data in a Health Care Environment. by: Samson Kiware Janelle Schroeder. Overview. Problems of RDBMS Why is it interesting to Health Care? Health Care XML Data Model What is Hadoop and Hypertable?
E N D
A Methodology for Implementing a Distributed Storage System for Structured XML data in a Health Care Environment. by: Samson Kiware Janelle Schroeder
Overview • Problems of RDBMS • Why is it interesting to Health Care? • Health Care XML Data Model • What is Hadoop and Hypertable? • Hadoop/Hypertable Architecture • Hadoop/Hypertable Solution • Server Config File • Hypertable Schema • HQL Sample • Research Contributions • Related and Future Work • Questions and Answers
Problems of RDBMS • Scalability • High cost of licensing, servers, memory and disks • Applications vary in volume of information required to access • Frequency of access - batch processing versus real time
Why is it interesting to Health Care? • Requires different methods of data access • batch processing for historical decision support • trending and research • real time patient care. • Large data applications • Write once, read many atmosphere • Reduction in server cost and licensing • Centralized management of servers
Health Care XML Data Model <Visit> <VisitNumber>67868687687<VisitNumber> <VisitDate>01/01/2008<VisitDate> <PrimaryPhy>Dr. Kiware<PrimaryPhy> <ReferringPhy>Dr. Schroeder<ReferringPhy> <ICD9Diagnosis>120.45< ICD9Diagnosis> <CPTProcedure>888.88<CPTProcedure> <Visit>
What are Hadoop and Hypertable? • Hadoop is a distributed computing platform for running a processing system • Hypertable is an open source, high performance, scalable, distributed storage processing system for structured and unstructured data
Hadoop/Hypertable Solution • Handles applications with large datasets • Detection of faults and quick recovery • High Throughput processing • Centralized scheduling of server tasks and execution of batch processes • Deployed on low cost hardware • Eliminates or reduces the need for table joins • Access user mechanism - improves I/O performance
Server config file <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>nematode</value> </property> <property> <name>mapred.job.tracker</name> <value>rat</value> </property> <property> <name>dfs.name.dir</name> <value>/logs</value> </property>
Server config file cont’ <property> <name>dfs.data.dir</name> <value>/data</value> </property> <property> <name>mapred.system.dir</name> <value>/mapred/system</value> </property> <property> <name>mapred.local.dir</name> <value>/MapReduceData</value> </property> <property> <name>mapred.tasktracker.{map|reduce}.tasks.maximum</name> <value>1</value> </property> <property> <name>dfs.hosts/dfs.hosts.exclude</name> <value></value> </property> <property> <name>mapred.hosts/mapred.hosts.exclude</name> <value></value> </property> </configuration>
Hypertable Schema hypertable> describe table Pages;<Schema generation="1"> <AccessGroup name="default"> <ColumnFamily id="1"> <Name>refer-url</Name> </ColumnFamily> <ColumnFamily id="2"> <Name>http-code</Name> </ColumnFamily> <ColumnFamily id="3"> <Name>date</Name> </ColumnFamily> </AccessGroup></Schema>
HQL Sample • Sample for a patient centric model, where MRN (medical record number of patient) serves as a row key and column families are created for different categories of health information: • See next slide
CREATE TABLE “Patient” ROWKEY: <MRN>12234434<MRN> Column_Family_Name: “Visit”, “Insurance”, “Genetic Profile” Column: “Visit” Value: <Visit> <VisitNumber>67868687687<VisitNumber> <VisitDate>01/01/2008<VisitDate> <PrimaryPhy>Dr. Kiware<PrimaryPhy> <ReferringPhy>Dr. Schroeder<ReferringPhy> <ICD9Diagnosis>120.45< ICD9Diagnosis> <CPTProcedure>888.88<CPTProcedure> <Visit> Timestamp: (TODAY’S DATE/TIME) HQL Sample
HQL Sample • Add insert statement
Research Contributions • Install Hadoop and Hypertable an open source, distributed storage system in cluster environment • Create documentation for installation in a Linux and Windows environment • Designed and implement a data model for a health care environment
Related and Future Work • Google’s BigTable • Web Crawler • Solution for managing xml schema versions • Conduct comparative performance research • Investigate job tracking and task scheduling • Apply 3-dimmensional data warehousing techniques (Type 1, Type 2 or Type 3)