170 likes | 300 Views
Application of Semantic Technology: Semantic Medline on the Cray XMT2. Victor J. Pollara. 2 October 2012. Overview. Background Government and industry are housing data of every kind.
E N D
Application of Semantic Technology: Semantic Medline on the Cray XMT2 Victor J. Pollara 2 October 2012
Overview Background Government and industry are housing data of every kind. Structured data sets may have rich information in them that is hidden because of how the database is designed. Most are in tabular formats (either relational DBs or flat files). A tremendous amount is in the form of text. ‘Big Data’ does not mean ‘Useful Data’ if you can’t answer the questions you need to Often we can greatly increase the utility of existing data by: Augmenting the existing data set with a small model (e.g. ontology, taxonomy) Integrating multiple sets on common data elements Extracting and structuring information from text Example using the XMT2 : Semantic Medline (Rindflesh, Shin, et al.) 60M+ High-confidence ‘facts’ extracted from 22M biomedical (PubMed) citations Augment it with biomedical knowledge models (e.g. UMLS Metathesaurus, NCBI Taxonomy) Integrate with other resources (e.g. Geonames) This talk: Tabular data and semantic data: bridging the gap Text data and semantic data: Semantic Medline Application Cray XMT2 Semantic Services with the XMT2 Augmentation and Integration Going beyond semantic data
Technologies for Bridging the Tabular-Semantic Gap There is a class of software products that creates ‘semantic’ views of the data in a relational database. Examples: • D2R/D2RQ • Does not disturb a live relational database. • Renders all the data in triple format. • Fully automatic (does not require a subject matter expert) • Is ignorant of the semantics of the data. • R2RML • Language in which a subject matter expert builds a model that adds semantics to the data in the database • When used with a tool like RevelytixSpyder, it provides a more semantic view of the data • Is a superset of SQL, so it relies on SQL to do all the heavy lifting • Not practical if data values map to non-regular URLs • Scripting languages (e.g. Perl) • Can do anything you want • Traditional ETL – creates another version of the data
The XMT2 vs. Bridge Technologies • The architecture of the XMT2 is suited for data that is not easily subdivided • Efficiency of computation requires the entire set to be held in shared memory • Data with little semantic content is not the best candidate (e.g. triplifying huge tabular arrays of numerical data is not appropriate) • Since you are going to create a copy of the data for the XMT2, the best approach is to remodel it to contain as rich a semantic structure as possible. • Any ontology that adds semantic richness can support new queries that might be valuable • Since you are doing ETL, a scripting language is appropriate.
Augmentation Why add more data to an already large set? • For example, Medicare collects vast amounts of claims data • Researchers can use it to evaluate the effectiveness of procedures or drugs • But the format makes it difficult to explore the data in medically meaningful ways UMLS Knowledge Model Tabular Claims Data Antibacterials Cephalosporin Penicillins Ampicillin Amoxicillin
Why Modeling is Crucial – e.g., NCBI Taxonomy • The NCBI taxonomy is only available as a RDB (tabular) dump: • Wrote Perl scripts to remodel the NCBI Taxonomy relational tables into a single RDF file. • R2RML would work well for this because names are uniform • http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606 • The other possibility would be to use an automatic D2R map of the DB nodes.dmpcolumn names: tax_id -- node id in GenBank taxonomy database parent tax_id-- parent node id in GenBank taxonomy database rank-- rank of this node (superkingdom, kingdom, ...) emblcode -- locus-name prefix; not unique division id -- see division.dmp file After remodeling you get the right meaning: these are subclasses and the structure is tree shaped D2R mapping leaves ids as integers 670904 tax_id ncbitax:562 r1 562 parenttax_id subclassOf subclassOf tax_id 745156 subclassOf r2 parenttax_id 562 ncbitax:866768 ncbitax:745156 ncbitax:670904 r3 tax_id 866768 parenttax_id 562
Text Extraction and Triples • The first task in text extraction is to identify entities (e.g. people, places, things, events) • Good for document characterization, document matching, categorization. • Natural language processing can go much further by: • tagging each term with its part of speech • Using the part-of-speech tags to extract ‘subject-verb-object’ triples • These triples mirror the triple structure of semantic data • Use controlled vocabularies and ontologies to manage entities and relations • Example: “Tamoxifenhas been shown in vitro to inhibit protein kinase C through estrogen receptor-independent antineoplastic effects.” tamoxifen urn:nlm.nih.gov:UMLS/CUI/C0039286 urn:nlm.nih.gov:semmed/relation/inhibits inhibits protein kinase C urn:nlm.nih.gov:UMLS/CUI/C0033634
Semantic Medline • The National Library of Medicine hosts a website that contains over 22M citations from the biomedical literature (PubMed). • Even though they are only titles and abstracts, there is a lot of knowledge in them • But the site only provides access to the citations by ‘search’ • NLM scientists (Rindflesh, Shin, et al.) built a web-app for exploring high-confidence ‘facts’ extracted from PubMed citations (Semantic Medline) • The ‘facts’ are represented most naturally as a graph • Without a high-performance triplestore server, they currently use a relational database (MySQL) to store the facts • We think the Cray XMT has potential to support a graph database as a replacement for MySQL. • We proposed to port Semantic Medline to the Noblis XMT2 • Cray has provided a Beta version triplestore server named uRiKA • It provides a SPARQL endpoint (analogous to a SQL connector for a MySQL) • First let’s look at Semantic Medline’s functionality…
Porting the Data to the Noblis XMT2 • The Computing Environment: • 4TB of shared memory • 128 cores, each capable of running 128 independent threads (16384 threads) • Maximum recommended size: 20 billion triples (occupies 2TB, but uRiKA uses the remaining 2TB as scratch space) • uRiKA provides a SPARQL endpoint as well as a web client a user can interact with directly. • ‘Service nodes’ are Linux machines separate from the ‘compute nodes’ and there is a communication latency between them that must be managed • Phase 1: Naïve triplification to test uRiKA as a triplestore server • Converted a key Semantic Medline MySQL table into triples (similar to D2R) • Included UMLS concepts (6 M) and instances of relations (21 M) • NCBI taxonomy (~1M taxa) • http://www.geonames. • Modified Semantic Medline code to issue SPARQL queries to uRiKA
Initial Observations • Initial Results for the Beta version of uRiKA • uRiKA processes complex SPARQL 1.0 queries properly • uRiKA is set up to cut through very complex queries that would stymie an RDB • But we were issuing trivial queries (lots of them) and it is not tuned for that kind of usage, so the aggregate response time was too slow for an acceptable user experience of the Semantic Medline webapp • Collaborators’ experiences with an alternative software library (Speed-MT) from Sandia Labs show faster results and we are looking at this library to see if it can be used in place of uRiKA or as an adjunct to it • uRiKA shows that the XMT2 can support web services • Cray will release an improved version of uRiKA in approximately 6 months • We believe there are many other services that could coexist on the same machine.
The XMT2 Supporting Multiple Services • A separate system called MeGraphs does the following: • Maintains a directory of graphs resident in memory • Provides an engine that can run different algorithms on a chosen graph in the directory • Supports job queueing • Provides an API for building client applications • The only thing missing, in our opinion, is a way for external processes to access the engine. • We are currently experimenting with: • Developing a client for MeGraphs that receives requests from the outside world and acts as a general REST service • Building custom services that support graph data that goes beyond ordinary triples • A key focus is on responsiveness of the services– Because of the XMT2 architecture, process initiation can be time consuming, so the goal will be to keep the data ‘live’ in shared memory and be sure that each service has a memory map of the data relevant to it, so that it can respond as quickly to requests as possible. XMT rest bridge External Processes
Representing Data in Graph Form Sam Tabular Data Sam Sam Zoe Zoe Zoe hasArmsSupplier P=0.8 hasMom P=1.0 SSN Name Addr hasArmsSupplier hasMom … … Sam Joe Joe Joe … … Joe … … Moe hasDad P=0.67 hasGrandparent P=0.67 hasGrandparent hasDad … … Pam … … Zoe Moe Moe Moe hasSpouse hasSpouse Pam Pam Pam hasSpouse P=1.0 hasSpouse • Social networks • “link analysis” • “degr. of sep.” • Edges may have weights representing strength or certainty • ------------------------- • “graph” has “nodes” and “edges” Semantic graph has named relations with direction. Permits much more sophisticated queries. Supports reasoning. ------------------------------- <Moe> <hasDad> <Joe> is called a “triple” in the semantic world Enhanced semantic graphs with weighted edges
Challenges • An important factor that makes the XMT2 outstanding for graph computation is a very efficient internal representation of a graph. • The efficiency comes from packing a lot of information tightly into a special data structure. • Deleting edges is relatively easy. Deleting nodes is more expensive but still not prohibitive. Adding new edges between existing nodes is very costly. • This seems to imply that transactional applications are not a good fit, but we are not convinced of that yet, and we plan to find out. • This is certainly true in the short term • Our goal is to build experimental services that support the basic “Create, Read, Update, Delete” (CRUD) operations with acceptable latency for webapps. • Analytical services are a good fit. • Dynamic fraud detection: build the graph in memory and as it is updated re-run analytical agents that look for the emergence of triggering conditions. • Entity resolution: as new attributes are assigned to entities, analytical agents would check if certain thresholds are crossed.
Conclusion • We believe that the XMT2 shows potential as a platform for providing semantic services on large semantic data sets • Over the next 12 months we will build a variety of services and test their utility and responsiveness • The internal graph representation is extensible in ways that could simultaneously support the logical queries of SPARQL and analytical methods that use other graph properties • This would enable us to tackle a much wider class of real world graph problems and we will build a catalogue of these problems and describe how this computational resource can be part of their solution