220 likes | 376 Views
Scalable Knowledge Extraction from Legacy Sources with SEEK. Joachim Hammer Dept. of CISE University of Florida jhammer@cise.ufl.edu 3-June-2003. Outline of Talk. Motivation SEEK Information Architecture Knowledge Extraction Schema Extraction Code Analysis Status and Future Work.
E N D
Scalable Knowledge Extraction from Legacy Sources with SEEK Joachim Hammer Dept. of CISE University of Florida jhammer@cise.ufl.edu 3-June-2003
Outline of Talk • Motivation • SEEK Information Architecture • Knowledge Extraction • Schema Extraction • Code Analysis • Status and Future Work
SEEK Project Faculty Joachim Hammer Mark Schmalz William O’Brien Ray Issa Joe Geunes Sherman Bai Current Students Oguzhan Topsakal Mingxi Wu Ivan Mutis Haiyan Xie Bibo Yang Bo Lu Computer Science Building Construction Industrial Engineering • Sponsored by NSF • Year 2 of 4
Motivation • Need for integrated access to intelligence sources in support of national security related applications • Ability to rapidly sift through massive volumes of data for situational analysis and planning • Hard … sources have unique and often incompatible information systems, varying levels of sophistication • E.g., Web, public records, sensor networks, state and federal databases, etc. • Current integration approaches rely on manual coding of connection software - Not scalable • Development of a toolkit to facilitate integration of heterogeneous legacy data and knowledge
public/ information access agency 1 agency 2 external data sources Information Environment reporting authority/ collaborative analysis Many computers, many users, many information needs
Application Areas • Homeland Defense • Threat prediction and detection • Emergency Management • Emergency response planning, damage assessment • Extended Enterprise/Supply Network • Decision/negotiation support to improve performance and customization
SEEK Environment & Context coordinator/ lead Agency Agency Agency … SEEK SEEK SEEK Decision Support/ Analysis Secure Hosting Infrastructure
SEEK Information Architecture Connection Toolkit Legacy Source SEEK Components Domain Expert Application Knowledge Extraction Module Analysis Module Legacy Data and Systems Secure, value-added extraction of source data Wrapper AM: query analysis, knowledge composition (mediation) W: source connection and translation KEM: configuration of W and AM at build-time
Run-Time: Querying and Analysis • Different information contexts: application, analysis module, source • Translator needed to convert between information contexts • Assume existence of translator between AM and application contexts • Analysis module provides robust (value-added) mediation • Solution strategy based on information available in source • Capable of composing final answer out of multiple source results • SEEK wrapper responsible for syntactic and semantic conversions • Formulates source queries based on capabilities of source • Restructures source results to conform to information context of AM
Build-Time: Knowledge Extraction • Extract information about legacy source to facilitate development of wrapper and configuration of AM • Produces “description” of accessible knowledge in source • Schema extraction from data source • Analysis of application code to augment schema with semantics and extract business rules • Schema Matching to infer mappings between information context of AM/application with that of legacy source • Quality and accuracy of extracted knowledge (and hence the wrapper and AM) improves over time and with human input
Architectural Overview Domain Model Domain Ontology Data Reverse Engineering (DRE) revise, validate Schema Information Schema Extractor (SE) Semantic Analyzer (SA) Embedded Queries train, validate Schema Matching (SM) Schema, semantics business rules Legacy Application Code Legacy DB to wrapper generator Mapping rules Legacy Source
Schema Extraction • Based on data reverse engineering algorithms, e.g., Chiang 94/95, Petit et al. 96 • Reduced dependency on human input • Eliminated limitations (e.g., consistent naming, legacy schema in 3-NF) • Use database catalog to directly extract concepts and simple constraints • Use database instances to infer relationships and constraints • Interact with code analysis to augment schema with semantics • Produces E/R-like representation of the entities, relationships, and constraints
Semantic Analysis • Identify semantic descriptions for schema items in database in application code • E.g., trace database schema names back to output statements • Using code slicing to reduce application code to only those statements that are of interest to the analyzer (Horwitz, Reps 92) • Apply pattern matcher • discover associations among variables • identify patterns that encode business information • E.g., business rules encoded in IF-THEN-ELSE statements • Versions for C, C++, and Java
DRE Implementation Legacy Source Application Code DB Interface Module Data configuration AST Generation 1 Dictionary Extraction 2 Queries AST Code Analysis 3 Metadata Repository Inclusion Dependency Mining 4 Business Knowledge Relation Classification 5 Schema Attribute Classification 6 Knowledge Encoder XML DTD Entity Identification 7 XML DOC Relationship Classification 8 To Schema Matcher
Legacy Database Schema Proj [P_ID,P_NAME,DES_S,DES_F,A_S,A_F, …] Avail[PROJ_ID,AVAIL_UID,RES_ID,AVAIL_FROM,AVAIL_TO,UNITS] Res [PROJ_ID,RES_UID,RES_NAME,R_ACWP,R_BCWP,R_BCWS, …] T [PROJ_ID,T_UID,T_ID,T_NAME,T_DUR,T_ST_D,T_FIN_D, …] Assn [PROJ_ID,ASSN_UID,T_UID,R_ID,ASSN_BASE_C,ASSN_ACT_W, …] . .
Scheduling Application /* program for task scheduling */ char *aValue; char *cValue; int bValue = 0; /* more code … */ EXEC SQL SELECT T_ST_D,T_FIN_D INTO :aValue, :cValue FROM T WHERE T_PRITY = :bValue; /* more code … */ int flag = 0; IF (cValue <= aValue) { flag = 1; /* exception handling */ } /* more code … */ printf (“Task Start Date %d“, aValue); printf (“Task Finish Date %d”, cValue); /* more code … */
Extracted Conceptual Schema Proj_ID P_Name P_ID Des_S Res_UID N has Proj Res 1 1 1 Res_Name N has Assn has Res_ID N M N T Avail Avail_UID Proj_ID T_ID Proj_ID T_UID
Extracted Business Rules • Variables have been replaced by their extracted meaning (to the extent that they are known)
Current Status & Future Research Current • Implemented interactive knowledge extraction prototype consisting of SE and SA (supply chain & construction domains) • Developing schema matching module • Application of SEEK toolkit to emergency response system • Data collection in cooperation with City of Gainesville Fire & Rescue • Application to management of EOC planned Future • Development of analysis module • Enhance DRE with ability to improve with time and usage cases
Summary and Conclusion • SEEK is a structured approach to integrating domain-specific legacy sources • Modular architecture provides several important capabilities • (Semi)automatic knowledge extraction • DRE, semantic analysis, schema matching • Important contributions to theory of knowledge capture and integration • Requirement for building scalable sharing architectures • Enabling technology for (semi)automatic ontology creation • Enabler for Semantic Web?
More Info • M. S. Schmalz, J. Hammer, M. Wu, and O. Topsakal, "EITH - A unifying representation for database schema and application code in enterprise knowledge extraction." To be presented at 22nd International Conference on Conceptual Modeling (ER 2003), Chicago, IL, 2003. • “Scalable Extraction of Enterprise Knowledge.” Conditionally accepted for publication in Research Frontiers in Supply Chain Management and E-Commerce, E. Akcaly, J. Geunes, P.M. Pardalos, H.E.Romeijn, and Z.J. Shen, (eds). Kluwer Science Series in Applied Optimization. (accepted for publication in 2004.) • “SEEKing knowledge in legacy information systems to support interoperability.” ECAI-02 Workshop on Ontologies and Semantic Interoperability, Lyon, France, July 21-26, 2002. • “SEEK: accomplishing enterprise information integration across heterogeneous sources,” ITCON – Journal of Information Technology in Construction – Special Edition on Knowledge Management, 7, pp. 101-124, 2002. • “Robust mediation of supply chain information.” ASCE Specialty Conference on Fully Integrated and Automated Project Processes (FIAPP) in Civil Engineering, Blacksburg, VA, September 26-28, 2001, 415-425. Web Site: http://www.dbcenter.cise.ufl.edu/seek/