270 likes | 385 Views
An Ontological Approach to the Document Access Problem of Insider Threat. ISI 2005 , (May 20). Boanerges Aleman-Meza 1 Phillip Burns 2 Matthew Eavenson 1 Devanand Palaniswami 1 Amit P. Sheth 1. (1) LSDIS Lab, Computer Science Dept., University of Georgia, USA
E N D
An Ontological Approach to the Document Access Problem ofInsider Threat ISI 2005, (May 20) Boanerges Aleman-Meza1 Phillip Burns2 Matthew Eavenson1 Devanand Palaniswami1 Amit P. Sheth1 (1) LSDIS Lab, Computer Science Dept., University of Georgia, USA (2) CTA – Computer Technology Associates USA
Objective & Approach • Determine if (classified) documents reviewed an IC analyst satisfy his/her “need to know” • Characterization of “need to know” w.r.t. ontology • Characterizing document content in terms of ontology • Discovering weighted semantic relationships between document content and “need to know”
Characterizing “Need to Know” using an a Semantic Approach (using Ontology) • Requires domain ontology • models important concepts & relationships of domain (schema), captures factual knowledge (instances) • Relate analyst’s need to know to concepts & relationships in ontology • e.g. terrorist organization, funding sources, facilitators, members, methods
“Need to know” = context of investigation 26,489 entities 34,513 (explicit) relationships Add relationship to context
Characterizing document content in terms of ontology “Semantic Annotation” • Correlate words/phrases from document with entities/relationships in ontology • Entity identification • Meta-data added to document (from associated ontological knowledge) • Active area of research but practically useful technology now available • Constrained to content of ontology
Semantic Relationships between Document & “Need to Know” • Semantic associations: relationships between document concepts & “need to know” concepts are discovered and ranked • Ranking based on multiple factors • no. of links, types of links, location in ontology, … • Ranking indicates degree of semantic “closeness” • and therefore, how related document is to “need to know”
DocumentsRanking • Highly relevant • Closely related • Ambiguous • Not relevant • Undeterminable
Research Content • Discovery & Ranking of semantic semantic associations • Characterizing “need to know” in terms of ontological concepts & relationships • Meta-data annotation of data and (semi-structured & unstructured) documents • correlation of document content & concepts in ontology
Research Challenges In this project we are addressing: • Discovery of Semantic Associations per entity per document • Input/Visualization/Management of Context of Investigation • Scalability on number of documents & ontology size • Performs well with thousand documents • Ranking of documents
Ranking of Documents Relevance “Closely related entities are more relevant than distant entities” E = {e | e Document } Ek = {f | distance(f, eE) = k }
1. Entities belong to classes in the Context type(entity) Context Entities match a list of entities of interest (in the Context) entity Entities-List 3. Components of Document Relevance 2. Relationships constrains Relationship [Class] Context of Investigation (specific entities) • Abu Abdallah • Turkmenistan • Konduz Province • …
Semagix Freedom Semagix Freedom Schematic of Ontological Approach to the Legitimate Access Problem
Conclusions • New Semantic Approach to the challenging problem • Viability demonstrated on a small scale • Significant new research that builds upon the latest Semantic Platform • Many applications of this approach: vendor vetting, knowledge discovery, ….
Acknowledgements • Semagix provided technology to populate ontology using knowledge extraction, and (semi-)automatic metadata extraction from documents (Freedom toolkit). • NSF-funded projects provided core research: "Semantic Association Identification and Knowledge Discovery for National Security Applications" (Grant No. IIS-0219649) and "Semantic Discovery: Discovering Complex Relationships in Semantic Web" (Grant No. IIS-0325464)
References • 1. B. Aleman-Meza, C. Halaschek, I.B. Arpinar, A. Sheth, Context-Aware Semantic Association • Ranking. Proceedings of Semantic Web and Databases Workshop, Berlin, September 7- • 8 2003, pp. 33-50 • 2. B. Aleman-Meza, C. Halaschek, A. Sheth, I.B. Arpinar, and G. Sannapareddy. SWETO: • Large-Scale Semantic Web Test-bed. Proceedings of the 16th International Conference on • Software Engineering and Knowledge Engineering (SEKE2004): Workshop on Ontology in • Action, Banff, Canada, June 21-24, 2004, pp. 490-493 • 3. R. Anderson and R. Brackney. Understanding the Insider Threat. Proceedings of a March • 2004 Workshop. Prepared for the Advanced Research and Development Activity (ARDA). • http://www.rand.org/publications/CF/CF196/ • 4. K. Anyanwu and A. Sheth ρ-Queries: Enabling Querying for Semantic Associations on the • Semantic Web The Twelfth International World Wide Web Conference, Budapest, Hungary, • 2003, pp. 690-699 • 5. K. Anyanwu, A. Maduko, A. Sheth, SemRank: Ranking Complex Relationship Search Results • on the Semantic Web, In Proceedings of the 14th International World Wide Web Conference, • Japan 2005 (accepted, to appear) • 6. K. Anyanwu, A. Maduko, A. Sheth, J. Miller. Top-k Path Query Evaluation in Semantic • Web Databases. (submitted for publication), 2005 • 7. C. Halaschek, B. Aleman-Meza, I.B. Arpinar, A. Sheth Discovering and Ranking Semantic • Associations over a Large RDF Metabase Demonstration Paper, VLDB 2004, 30th International • Conference on Very Large Data Bases, Toronto, Canada, 30 August - 3 September, • 2004 • 8. B. Hammond, A. Sheth, and K. Kochut, Semantic Enhancement Engine: A Modular Document • Enhancement Platform for Semantic Applications over Heterogeneous Content, in • Real World Semantic Web Applications, V. Kashyap and L. Shklar, Eds., IOS Press, December • 2002, pp. 29-49
References (cont) • 9. M. Rectenwald, K. Lee, Y. Seo, J.A. Giampapa, and K. Sycara. Proof of Concept System for • Automatically Determining Need-to-Know Access Privileges: Installation Notes and User • Guide. Technical Report CMU-RI-TR-04-56, Robotics Institute, Carnegie Mellon University, • October, 2004. • http://www.ri.cmu.edu/pub_files/pub4/rectenwald_michael_2004_3/rectenwald_michael_20 • 04_3.pdf • 10. C. Rocha, D. Schwabe, M.P. Aragao. A Hybrid Approach for Searching in the Semantic • Web, In Proceedings of the 13th International World Wide Web, Conference, New York, • May 2004, pp. 374-383. • 11. M.A. Rodriguez, M.J. Egenhofer, Determining Semantic Similarity Among Entity Classes • from Different Ontologies, IEEE Transactions on Knowledge and Data Engineering 2003 • 15(2):442-456 • 12. A. Sheth, C. Bertram, D. Avant, B. Hammond, K. Kochut, and Y. Warke. Managing Semantic • Content for the Web. IEEE Internet Computing, 2002. 6(4):80-87 • 13. A. Sheth, B. Aleman-Meza, I.B. Arpinar, C. Halaschek, C. Ramakrishnan, C. Bertram, Y. • Warke, D. Avant, F.S. Arpinar, K. Anyanwu, and K. Kochut. Semantic Association Identification • and Knowledge Discovery for National Security Applications. Journal of Database • Management, Jan-Mar 2005, 16 (1):33-53 • 14. Boanerges Aleman-Meza, Phillip Burns, Matthew Eavenson,Devanand Palaniswami, Amit Sheth. An Ontological Approach to the Document Access Problem of Insider Threat
Semantic Annotation • Document searched for entity names (or synonyms) contained in ontology • Then document entities are annotated with additional information from corresponding entities in ontology including named relationships to other entities • Following chart is example • Highlighted text are entities found corresponding to concepts in ontology • XML is corresponding meta-data annotation
Relevance Measures for Documents(relating document content to IA “need to know” • Relevance engine input • the set of semantically annotated documents • the context of investigation for the assignment • the ontology schema represented in RDFS, and the ontology instances represented in RDF • Relevance measure function used to verify whether the entity annotations in the annotated document can be fit into the entity classes, entity instances, and/or keywords specified in the context of investigation.
SWETO – Ontology Schema Visualization See SemDis project of LSDIS Lab, University of Georgia
Relevance Measures for Documents(relating document content to IA “need to know” (cont) • Documents classified as: • Highly relevant • Document entities directly related • Closely related • Document entities related through strong semantic associations • Ambiguous • Document entities related through weak semantic associations • Not relevant • Document entities not related to “need to know” • Undeterminable • Document entities not found in ontology
IA Context of Investigation(characterization of “Need to Know”) We define the context of investigation as a combination of the following: • A set of entity classes and relationships, and/or a negation of a set of entity classes and relationships • A set of entity instance names, and/or a negation of a set of entity instance names • A set of keyword values that might appear at any attribute of the populated instance data, and/or a negation of a set of keyword values
Context of Investigation (cont) • Goal is to capture, at a high level, the types of entities, (or relationships), that are considered important. • Relationships can be constrained to be associated with specified class types • E.G. It can be specified that a relation ‘affiliated with’ is part of the context only when it is connected with an entity that belongs to a specific class, say, ‘Terror Organization’
Ranking of Documents Relevance Four groups of document-ranking: • Not Related Documents • unable to determine relation to context • Ambiguously Related Documents • some relationship exists to the context • Somehow Related Documents • Entities are closely related to the context • Highly Related Documents • Entities are a direct match to the context Cut-off values determine grouping of documents w.r.t. relevance • These are customizable cut-off values (more control and more meaningful parameters compared to say automatic classification or statistical approaches) “Inspection” of a document is possible via (a) original document or (b) original document with highlighted entities