450 likes | 698 Views
Knowledge representation, modeling, acquisition and management with applications in Natural Language Processing. ( PhD Thesis proposal ). Pavlin Dobrev < p.dobrev@prosyst.com > Scientific Advisor: Galia Angelova Linguistic Modeling Department, Institute for Parallel Processing,
E N D
Knowledge representation, modeling, acquisition and management withapplications in Natural Language Processing (PhD Thesis proposal) Pavlin Dobrev<p.dobrev@prosyst.com> Scientific Advisor: Galia Angelova Linguistic Modeling Department, Institute for Parallel Processing, Bulgarian Academy of Sciences
The concept of knowledge • Meaning of knowledge in PhD thesis • Linguistics knowledge • Knowledge related to the particular domain, concept type hierarchy, relation type hierarchy, definitions • Formalism for knowledge representation • Conceptual Graph • Ontologies/hierarchical structures
Current results and publications • Workgroup for conceptual graph: • Formats for knowledge representation • Representation and modeling of knowledge with conceptual graph • Editing of knowledge represented with conceptual graph • Architectures of applications for processing of conceptual graph • Extraction of conceptual graph from controlled English • Semantic web and natural language processing • Using of conceptual graph in the area of semantic web for semantic annotations • Integration of application for processing of conceptual graph
Conceptual Graph • As defined in Conceptual Graph Standard a conceptual graph (CG or graph) is an abstract representation of logic with nodes called concepts and conceptual relations, linked together by arcs • They express meaning in the form that is: • logically precise • humanly readable • computationally tractable • In CGWorld conceptual graph is any collection of concepts and relations linked by their appropriate arrows or co-referent links
Representation of sentence „John is going to Boston by bus” as conceptual graph
CGWorld – from Conceptual Graph Theory to the Implementation CGWorld was first introduced at ICCS 2000. Future development was presented at ICCS conferences during the next years. The main goals followed in the design and development of the CGWorld workbench are: • to allow for collaborative, distributed acquisition and editing of a CG knowledge base; • to provide easy search and navigation in a large KB; • to maintain different representation languages, thus accommodating the needs of different users of CGWorld and the different applications the KB of CGs is used in; • to provide a graphical editor and viewer for CGs that is easy to use by non-experts in CG theory • to integrate and add Web access to previously developed CG applications, written in different programming languages.
Conceptual Graph editor Primary Market is defined as Financial Market where newly issued financial instruments are traded.
Main features of the CGWorld Editor • Portable across all platforms (It has been tested with the most popular browsers – Opera, Mozila, Netscape and Internet Explorer); • Any number of graph windows may be opened for editing; • Concepts, relations, arcs, coreference links and contexts are supported for editing via a simple Drag & Drop interface; • Ability to customize the color, the position and the size of conceptual objects; • Ability to assign any number of additional properties to the conceptual objects (e. g. number, definite marker, comment); • Zooming capability; • Storing and retrieving of conceptual graphs to/from the application server.
A convertible bond is one which is convertible into the company common stock When a bond is converted to common stock, the corporate debt is reduced A bond is converted into common stock Conceptual Graph Knowledge Base (Visual and CGIF)
Formats for knowledge representations (1/2) Bond is a security which represents debt of corporation • Different Format of Knowledge Representation: • CGIF • FOL • CGLex • CGXML
NL: A bond is converted into common stock. FOL: exists(A1,exists(A0,convert_into(A0,A1) & bond(A0) & common_stock(A1))) Formats for knowledge representations (2/2) Different Format of Knowledge Representation • XML: • - <relation type="convert_into"> • - <concept type="bond"> • <number type="single" /> • </concept> • - <concept type="common_stock"> • <number type="single" /> • </concept> • </relation> CGLex: cgc(55,simple,'bond',[fs(num,sing)],[]).cgc(53,simple,'common_stock',[fs(num,sing)],[]).cg(155,[cgr(convert_into, [55, 53], _)], none, fs(kind,'body_of_context'), fs(comment,'A bond is converted into common stock')]).
Challenge: to acquire formal specifications from NL • too complicated task -> at present feasible for controlled NL only • there are many approaches to acquire CGs from controlled NL (see the proceedings); in general: • with limited vocabulary, • recognition of phrases and/or simple sentences, • often missing syntactic analysis as a separate module, • type labels are juxtaposed to sentence words, • relation types: either thematic roles, or key-words, • limited capacity to acquire contexts and coreferences.
Our approach: start from a NLU machine • PARASITE (Allan Ramsay, UMIST, Manchester, UK) provides syntactic analysis and processes extended discourse (i.e. recognises coreferences) • builds a “model” for every semantically correct discourse (and logical forms for each sentence) • Checks contradictions with the given meaning postulates • Our prototype CGExtract is focused on the proper KB issues
CGExtract • Acquires from English sentences • type hierarchy • type definitions • graphs • checks loop definitions and contradictions between the newly defined graph and the existing KB facts • Visualisation provided by CGWorlds modules (see windows and menus in the text)
Logical Model of the Input Text • for(n541, n546(n541)). • issue(n541). • theta(n541, agent, n543). • theta(n541, object, n544). • local_government(n543). • authority(n543). • municipal_bond(n544). • theta(n544, purpose n545). • theta(n545, agent, n543). • pay(n545). • community_infrastructure_project(n546(n541)). • predication(n925). • theta(n925, topic, n929). • theta(n925, pred, n929). • investor(n928). • income_tax(n929). • free(n929). • of(n929, n928). • interest(n929). A local government authority issues a municipal bond to pay for a community infrastructure project. An interest of the municipal bond is an income tax free
Generated Conceptual Graph from the Input Text cgc(101,simple,interest,[],_3661). cgc(102,simple,free,[],_4244). cgc(103,simple,income_tax,[],_4788). cgc(104,simple,predication,[],_5332). cgc(105,simple,community_infrastructure_project,[],_5902). cgc(106,simple,pay,[],_6459). cgc(107,simple,municipal_bond,[],_7029). cgc(108,simple,government_authority,[],_7612). cgc(109,simple,issue,[],_8182). cg(110,[cgr(of,[101,107],_8734),cgr(pred,[104,101],_8776),cgr(topic,[104,101],_8818),cgr(agent,[106,108],_8860),cgr(purpose,[107,106],_8902),cgr(object,[109,107],_8944),cgr(agent,[109,108],_8986),cgr(for,[109,105],_9028)],[],[fs(kind,normal),fs(comment,)]).
Negative sides of our approach • Too much resources required for filling in the linguistic data (e.g. the lexicon; fortunately most of the English syntax is embedded in PARASITE) • Special efforts to understand the existing PARASITE’s prover
Positive sides of our approach • (1) Lexicon, (2) meaning postulates (similarly to canonical graphs) as well as (3) initial type hierarchy are always obligatory for automatic KA - all systems need to have them - so what we gain is the syntactic analysis and the embedded semantic analysis of the linguistic semantics • This allows us to focus on the proper KB consistency
Semantic Web Challenges V. Richard Benjamins, Jesús Contreras, Oscar Corcho and Asunción Gómez-Pérez, The six challenges for the Semantic Web. White paper 2002 • Challenge 1: The Availability of Content • Challenge 2: Ontology Availability, Development and Evolution • Challenge 3: Scalability of Semantic Web Content • Challenge 4: Multilinguality • Challenge 5: Visualization • Challenge 6: Semantic Web Languages Standardization
Semantic Web Challenges • Challenge 1: Almost no annotated content • Challenge 2: according to the results from Interoperability Working Days in Madrid (October 10th - 11th 2005) we are still far from achieving ontology development tools interoperability using RDF(S) as an interchange format. • Challenge 3: We cannot talk about the scalability because of non availability of the content. • Challenge 4: Most work in the Semantic Web area only for English • Challenge 5: No standards for visualization – Maybe CG • Challenge 6: Semantic Web Languages Standardization – It was expected RDF and OWL to be available in 2002. World Wide Web Consortium Issues RDF and OWL Recommendations at 10 Feb 2004
Ontology visualization for the semantic web • Simultaneous view of a concept in the ontology hierarchy and its instances on the web page • draw lines between concept and its instances • Showing both language context of concept’s usage as well as its ontological environment • Application: support user’s comprehension while reading a web page
Semantic Annotations • Manual annotation strictly depends on the individual -> result is ambiguous • Fully automatic annotation is impossible - human intervention is always necessary • Prototype that uses: • NLP for automatic extraction of formal knowledge (CGs) • CGs for visualization and enrichment of annotations
Extract conceptual graphs from texts • Visualization of concepts’ properties and their relationships as Conceptual Graphs (CGs) • GGs querying and inference capabilities can be exploited • Concept -> view assertions relevant to this concept • CGs are extracted from CG KB developed by a previous project and extended by the prototype
Simplify conceptual graph • Simplify Conceptual Graph – Type Contraction Operation • typedef“verb” is • [AgentType] <- (agnt) <- [verb] -> (obj) -> [ObjectType]. • [Concept] - > (def) -> [Concept: CG].
Simplify conceptual graph • Simplify Conceptual Graph – Type Contraction Operation • typedef“verb” is • [AgentType] <- (agnt) <- [verb] -> (obj) -> [ObjectType]. • [Concept] - > (def) -> [Concept: CG].
CG Tools Integration • The formal approach for integration is chosen based on definition of Levels of Conceptual Interoperability Model - LCIM: • Andreas Tolk, James Muguira, The Levels of Conceptual Interoperability Model (LCIM), Fall Simulation Interoperability Workshop, Orlando, FL, September 2003 • [Т04] Andreas Tolk, Composable Mission Spaces and M&S Repositories - Applicability of Open Standards Spring Simulation Interoperability Workshop, Washington, D.C., April 2004 • It is expected LCIM to be part of the Software Integration standard of Software Engineering Institute of CMU : http://www.sei.cmu.edu/isis/guide/introduction/lcim.htm
Levels of Conceptual Interoperability (LCIM) • On level 0, no connection is established at all. • On level 1, the technical level, physical connectivity is established allowing bits and bytes to be exchange. • On level two, the syntactical level, data can be exchanged in standardized formats, i.e., the same protocols and formats are supported. • On level 3, the semantic level, not only data but also its contexts, i.e. information, can be exchanged. The unambiguous meaning of data is defined by common reference models. • On level 4, the pragmatic/dynamical level, information and its use and applicability, i.e. knowledge, can be exchanged. The applicability of information is here defined in an unambiguous form. • On level 5, the conceptual level, a common view of the world is established, i.e. an epistemology.1 This level not only comprises the implemented knowledge, but also the interrelations between these elements.
Amine Platform Man eats food with spoon
CharGer Charger use XML file for CG
Interoperability problems of current systems based on conceptual graph • All available CG Tools are result from some research projects • Implement CGIF. Non availability of the content • Software platform and test suites • No standardized display form • No Web services available • No standardization of CG in the Semantic Web Languages • Reasoning with Conceptual Graphs • Persistence and Scalability • Internal Formats of CGs Representations Used in the Tools
Levels of Conceptual Interoperability (LCIM) for existing systems (1/2) • Level 1, the technical level – we have it. Most of the tools are in Java and/or can interact with Java (e.g. HTTP access to WebKB). The authors of CG Tools must find information exchange points that are possible to be used in order to exchange data and/or components. • Level 2, the syntactical level – we do not have it. We are far from achieving CGIF interoperability. If we implement common standards like XML, WSDL, UDDI and requester services in a common registry, we will have it.
Levels of Conceptual Interoperability (LCIM) for existing systems (2/2) • Level 3, the semantic level – we do not have it. We will achieve it if we agree on a common understanding what CGIF and CG are and how they must be processed and visualized by the CG Tools. • Level 4, the pragmatic/dynamical level – we do not have it. We need to define a common architecture or standard that is enough open in order to allow components from one tool to be reused in other, test data sets and services interoperability. • Level 5, the conceptual level – we do not have it. Ontology standards are required in order to achieve it. Good direction is Standard Upper Ontology (http://suo.ieee.org/).
Current results (1/2) • The formal methods for knowledge representation, modeling, acquisition and management are analyzed and classified. • Research on main methods for visualization of knowledge base is performed. • Algorithms for representation of knowledge using different formalisms are created as well as algorithms for converting between them. • The component architecture for system for knowledge representation, modeling, acquisition and management is designed. It is based on the natural language processing technologies. • The approach for using of automatically annotated texts is developed including editing of the annotation by the knowledge engineers.
Current results (2/2) • CGWorld - A Web Based Workbench for Conceptual Graphs Management and Applications • Application of CGWorld in Larflast – financial domain. • Integration of existing systems (DBR-MAT) • CGExtract – extract of the conceptual graph from controlled English • ViSem prototype for semantic annotations using conceptual graph. • CGWolrd is available on:http://larflast.bas.bg:8080/
Conclusion and Further Work • The general idea is to provide a set of components that can be used as building blocks for CG applications • The integration of CGs in web page annotation enables: • better visualization • easy editing and enrichment of annotations • Future directions • more on visualization of semantic web knowledge • Lots of work must be done before we can really say that we have interoperability between the GC Tools.
Publications and references • Number of references related to the PhD thesis – 10 • Conferencies: ICCS 2000, ICCS 2001, ICCS 2002, AIMSA 2004, ICCS 2005, BIS 21++ Information Days, ICCS 2006 (Accepted for publication) • References of these publications – 12. With big number of references - 4: • CGWorld - A Web Based Workbench for Conceptual Graphs Management and Applications • CGExtract: towards Extraction of Conceptual Graphs from Controlled English • References to the CGWorld home page – 8
Publications related to the PhD thesis • P. Dobrev. CG Tools Interoperability and the Semantic Web Challenges, accepted for publication in Contributions to ICCS 2006 - 14th International Conference on Conceptual Structures, Aalborg University Press • P. Dobrev. Knowledge Management in Natural Language Processing Using Conceptual Graph, BIS 21++ Information Days, 21 - 23 March 2006, Velingrad, Hotel Kamena, http://www.euromap.bas.bg/velingrad/PDobrev.zip • P. Dobrev, A. Strupchanska. Conceptual Graphs and Annotated Semantic Web Pages. In Common Semantics for Sharing Knowledge: Contributions to ICCS 2005, 13th International Conference on Conceptual Structures, ICCS 2005, Kassel, Gremany, pp. 54-65, ISBN 3-89958-138-5 • Dobrev P, Strupchanska A., Angelova G., Towards a Better Understanding of the Language Content in the Semantic Web, AIMSA 2004, Varna Bulgaria, September 2004 • Dobrev P., Toutanova K., CGWorld - Architecture and Features, ICCS 2002, Borovets, Bulgaria, July 2002, Lecture Notes in Computer Science 2393 Springer 2002, ISBN 3-540-43901-3 • Dobrev P., Strupchanska A., Toutanova K, CGWorld - from Conceptual Graph Theory to the Implementation, Applications with Conceptual Structures Workshop at ICCS-2002, Borovets, Bulgaria, July 2002 • Boytcheva Sv., P. Dobrev and G. Angelova. CGExtract: towards Extraction of Conceptual Graphs from Controlled English. In: G. Mineau (Ed.), Conceptual Structures: Extracting and Representing Semantics, Contributions to ICCS-2001, the 9th Int. Conference on Conceptual Structures, Stanford, California, August 2001, pp. 89-116. • Dobrev P., Strupchanska A., Toutanova K., CGWorld-2001 - New Features and New Directions, CGTools Workshop at ICCS-2001, Stanford, CA, USA, August 2001, electronic proceedings at http://www.cs.nmsu.edu/~hdp/CGTools/proceedings/papers/CGWorld.pdf • A. Strupchanska, P. Dobrev, S. Boytcheva, T. Nikolov, K. Toutanova, SampleKnowledge Base in Finance,Contribution to CGTools Workshop at ICCS 2001 (http://www.ksl.stanford.edu/iccs2001/CGTools/) • Dobrev P., Toutanova K., CGWorld - A Web Based Workbench for Conceptual Graphs Management and Applications, In Proceedings of the ICCS-2000 (Working with Conceptual Structures), Darmstadt, Germany, August 2000
Plan for finishing of the PhD thesis • Paper for magazine Information and Control that present summary of the results of the PhD thesis. • Paper for ICCS 2007 that includes new results of the PhD student. • Finishing of the full text of the PhD thesis in 9-12 months: • Extending of the survey of the current methods and systems for knowledge representation, modeling, acquisition and management • Including new results of the PhD student • Extending the explanation of the existing systems that are based on the results included in the PhD thesis • Extending of the English-Bulgarian dictionary for the domain