1 / 11

Applications of Semantic Technology

Applications of Semantic Technology. Victor J. Pollara. 24 January 2013. Overview. The development of “Semantic Technology” represents the confluence of several fields: The Internet/Web Knowledge modeling (most notably the field of ontology) Mathematical logic and Computational logic

olaf
Download Presentation

Applications of Semantic Technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applications of Semantic Technology Victor J. Pollara 24 January 2013

  2. Overview The development of “Semantic Technology” represents the confluence of several fields: The Internet/Web Knowledge modeling (most notably the field of ontology) Mathematical logic and Computational logic Database technology The general advancement of complex applications with rich GUIs Because of these diverse origins, there is a variety of ways in which the technology is commonly used Adding machine-readable information to web components to support interoperation and autonomous action by software agents Augmenting an existing data set with a model (e.g. ontology, taxonomy) Integrating multiple data sets on common data elements with well-defined meanings Extracting and structuring information from text Implementing knowledgebases (reservoirs of knowledge, support simple reasoning) Analysis of “graph-based” problems: Social network analysis (benign – e.g. Facebook, malign – e.g. terrorist networks) Cybersecurity analysis Fraud detection and surveillance

  3. Representing Data in Graph Form Sam Tabular Data Sam Sam Zoe Zoe Zoe hasArmsSupplier P=0.8 hasMom P=1.0 SSN Name Addr hasArmsSupplier hasMom … … Sam Joe Joe Joe … … Joe … … Moe hasDad P=0.67 hasGrandparent P=0.67 hasGrandparent hasDad … … Pam … … Zoe Moe Moe Moe hasSpouse hasSpouse Pam Pam Pam hasSpouse P=1.0 hasSpouse • Social networks • “link analysis” • “degr. of sep.” • Edges may have weights representing strength or certainty • ------------------------- • “graph” has “nodes” and “edges” Semantic graph has named relations with direction. Permits much more sophisticated queries. Supports reasoning. ------------------------------- <Moe> <hasDad> <Joe> is called a “triple” in the semantic world Enhanced semantic graphs with weighted edges

  4. Cybersecurity: Security Event Analysis • Organizations commonly deploy “event logging software” to record the events that occur in their networks (e.g. ArcSight) • The most basic data collected for each event is the source and destination IP address • This can be naturally represented as a network graph of nodes (IP addresses) and edges (event that links two addresses) • The number of events generated for a mid-sized company is in the billions, so the graph to be analyzed is large. • The kinds of queries needed to identify problems range over the entire graph, so subdividing the graph (e.g. Hadoop-MapReduce) can make some queries not feasible

  5. Graph Model of Events Transform security event data into a semantic graph and examine all relationships to identifyunknowncyberthreats

  6. Knowledgebase: Semantic Medline • Example using the XMT2 : Semantic Medline (Rindflesh, Shin, et al.) • 60M+ High-confidence ‘facts’ extracted from 22M biomedical (PubMed) citations • Augment it with biomedical knowledge models (e.g. UMLS Metathesaurus, NCBI Taxonomy) • Integrate with other resources (e.g. Geonames) • The Computing Environment: • 4TB of shared memory • 128 cores, each capable of running 128 independent threads (16384 threads) • Maximum recommended size: 20 billion triples (occupies 2TB, but uRiKA uses the remaining 2TB as scratch space) • uRiKA provides a SPARQL endpoint as well as a web client a user can interact with directly. • ‘Service nodes’ are Linux machines separate from the ‘compute nodes’ and there is a communication latency between them that must be managed

  7. The XMT2 • The architecture of the XMT2 is suited for data that is not easily subdivided • Efficiency of computation requires the entire set to be held in shared memory • Data with little semantic content is not the best candidate (e.g. triplifying huge tabular arrays of numerical data is not appropriate) • Since you are going to create a copy of the data for the XMT2, the best approach is to remodel it to contain as rich a semantic structure as possible. • Any ontology that adds semantic richness can support new queries that might be valuable • Since you are doing ETL, a scripting language is appropriate. • The XMT2 is not intended to serve as a persistence layer for transactional applications. It is best suited to non-subdividable graph analytic problems that have billions of nodes and edges.

  8. Text Extraction and Triples • The first task in text extraction is to identify entities (e.g. people, places, things, events) • Good for document characterization, document matching, categorization. • Natural language processing can go much further by: • tagging each term with its part of speech • Using the part-of-speech tags to extract ‘subject-verb-object’ triples • These triples mirror the triple structure of semantic data • Use controlled vocabularies and ontologies to manage entities and relations • Example: “Tamoxifenhas been shown in vitro to inhibit protein kinase C through estrogen receptor-independent antineoplastic effects.” tamoxifen urn:nlm.nih.gov:UMLS/CUI/C0039286 urn:nlm.nih.gov:semmed/relation/inhibits inhibits protein kinase C urn:nlm.nih.gov:UMLS/CUI/C0033634

  9. Semantic Medline • The National Library of Medicine hosts a website that contains over 22M citations from the biomedical literature (PubMed). • Even though they are only titles and abstracts, there is a lot of knowledge in them • But the site only provides access to the citations by ‘search’ • NLM scientists (Rindflesh, Shin, et al.) built a web-app for exploring high-confidence ‘facts’ extracted from PubMed citations (Semantic Medline) • The ‘facts’ are represented most naturally as a graph • Without a high-performance triplestore server, they currently use a relational database (MySQL) to store the facts • We are testing the Cray XMT to see if it has potential to support a graph database as a replacement for MySQL. • We proposed to port Semantic Medline to the Noblis XMT2 • Cray has provided a Beta version triplestore server named uRiKA • It provides a SPARQL endpoint (analogous to a SQL connector for a MySQL) • First let’s look at Semantic Medline’s functionality…

  10. A Network Presentation of Biomedical Facts

  11. Comments • Many problems can (at least in part) be represented as graphs • Semantic technology can be used in a variety of ways to solve a wide range of problems • Some application areas (knowledgebases, fraud detection) give rise to very large graphs that are not easily subdividable • The XMT2 is showing potential as a platform for providing analytical services on large semantic data sets • Current versions of uRiKA do not have a fast enough response time to support transactional applications like Semantic Medline. We are expecting a new release of uRiKA this winter that is designed to work an order of magnitude faster on the type of query we are testing. • Noblis has ongoing projects to explore the use of the XMT for cybersecurity, law enforcement, and network analysis.

More Related