280 likes | 411 Views
Izmit, June 12, 2014. Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań. Zygmunt Vetulani. Adam Mickiewicz University in Poznań Dept of Computer Linguistics and Artificial Intelligence vetulani@amu.edu.pl.
Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Zygmunt Vetulani Adam Mickiewicz UniversityinPoznań Dept of Computer Linguistics and ArtificialIntelligence vetulani@amu.edu.pl
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Natural language technologies are developed at the UAM in Poznań since many years. Some of these activities started in 70 and resulted with significant achievments (eg. in the area of vocal synthesis (text-to-speech) (M. Steffen-Batóg)). Systematic NL-related activities at the Faculty of Mathematics and Computer Science) are more recent and started after research visit of Vetulani at the University Aix-Marseille II in the Artificial Intelligence Group headed by Alain Colmerauer (1984). (Individual research started in the 1980s) Recent works On the basis of our know-how and technologies obtained so far we started in 2006 a large project (POLINT-112-SMS) which integrates several NL technologies. This project was funded by Polish Govermentfrom 2006 to 2010 (within a larger program "Textprocessingtechnologies for public securitypurposes" (Grant MNiSzW nr R00 028 02)) and iscontinued(Zygmunt Vetulani). Now: mainfocus on the development of PolNet (a PolishWordnet). Izmit, June 12, 2014
POLINT-112-SMS Team Recentresearch Department of Computer Linguistics and Artificial Intelligence PL-61614 Poznań, ul. Umultowska 87 tel. +48-61-8295380, fax +48-61-8295315 Head of Department: Prof. dr hab. ZygmuntVetulani http://main/amu.edu.pl/~zlisi vetulani@amu.edu.pl Departmentmembers and closecollaborators of the Polint-112-SMS project in 2010
Natural LanguageTechnologies atthe Facultyof Mathematics andComputer Science of the Adam Mickiewicz University inPoznań Polint-112-SMS isintended to collect and processinformationreported to the system by humanoperators in naturallanguage (text). The objective of processingis to producesummaryraportsdescribing (in real time) a dynamicallyevolvingstituation. Applications of variouskindsmay be realised on the basis of such a system: • monitoring naturaldisasters (ertheakes, flood, forestfire, volcanboerruptions) • monitoring the crowdat mass events (e.g. „high-risk” football matches) • supportinghumanexplorers (groups) in difficult, unknownenvironmen • varouspossiblemilitaryusecases Processing tasks: - visualisation - decisionsupport Izmit, June 12, 2014
Natural LanguageTechnologies atthe Facultyof Mathematics andComputer Science of the Adam Mickiewicz University inPoznań The main technology the project islanguage understanding. As a study case we proposed monitoring of the soccer stadium at a match with a large number of supporters. Such situation are usually generating a number of risks. High-riskmatches are covered (typically) by video and humanmonitoring (often at100% coverage). POLINT-112-SMS has as its main functionality tointeractivelycooperate with humans and to provide assistance in the decision making (in critical situations which require an immediate action). Decisions must be taken on the basis of the current situation analysis. A representation of the current situation is on-line compiled by the system from information elements and processed to obtain the decision supporting elements. Still, the consulted stadium security experts consider that the video monitoring is usually unsufficientand the human, on-site supervising, is necessary. Rests the problem of how to assure communication and how to complete interpretation of messages send by the informers. Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań The current situation is on-line human monitored. The observers/informers are supposed to supply information. The blues arrows represent information pieces send directly to the CCM while the red arrows stand for messages exchanged between the informers and the computer. The blue messages may also be seen by the computer. The system is interactive. That means that it may take control of the dialogue and address questions and messages to the informers. It is also interactive with respect to the CCM : it informs and may receive questions from the CCM. Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań The POLINT-112-SMS system is a proposal of a technological solution. It offers as communication mode usage of short text messages (similar to ordinary SMS messages) send from mobile phones. This communication mode permits to avoid using voice, which is of low utility in the noisy environment and which may unmask the police informers (what should be possibly avoided). Messages are send by the informers to the machine through the SMS gate and are then processed by the system. The name POLINT-112-SMS refers to: - the family of various versions of the POLINT system being developed so far, - the 112 emergency number (services as 112 may be supported by systems as POLINT-112-SMS), - the SMS technology which is cheap and popular. Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Message exchange is done in natural (human) langage(Polish). This means that the system must be doted with language competence as well as with communicative competence. The prototype has been tested by public security experts both in simulated and real-life situations of a football match at the city stadium in Poznań. Test messages (SMS) were exchanged using public cellular phones. Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań POLINT-112-SMS system architecture In red : modules using natural language technologies. Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań • g) Ontology (PolNet) • h) Knowledge Bases • - about events • - about actesofcommnication • i) CCM terminal (admin) • - visualisation • - administration • - capturing and dislpayingtexte a) SMS gate (capturing texte) b) Natural Language Processing module - understanding - generation c) Dialogue Maintenance Module d) Situation Analysis Module - desambuiguisation - reasoning - information search/query answering e) Temporal analysis module d'analysetemporelle f) Knowledge processing module Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań POLINT-112-SMS system is a product of a man-machine communication technology which is an AI technology using in particular various (lovel level) natural language technologies. The man-machine communication requires implementation of appropriate man-machine interfaces. In the case of POLINT-112-SMS we use NL text interfaces dedicated to two kind of users : information suppliers (informers) and target beneficients (CCM staff). The informers’ messages, queries and answers are entered to the system from mobile phones through the SMS gate. The another input-output device is the terminal at the CCM. It recieve and output text and display the images recieved from the visualisationsubmodule. It is also possible to display the past dialogue in form of structured texte. Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań POLINT-112-SMS system is a product of a man-machine communication technology (which is an AI technology). Among the lovel level language technologies involved , the highest one is understanding. The NLP and Dialogue Maintenance moduls are both contributing to understanding. The understanding software takes an element of the text and interprets it, i.e. it calculates its representation which is then submitted to further processing. Typically, the procedure of understanding a question produces a formal object which initiates a procedure responsible for answer finding. Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Technologies which contribute to comprehension are parsingas well as discourse analysis. Parsingin POLINT-112-SMS is executed by the PROLOG programming language (PROLOG may be considered as a shell of expert systems) whose interpreter may be used as parser for a properly formalized grammar (e.g. CFG /context free grammar/ DCG). The main drawback: PROLOG may be ineffective. Oursolution: heuristic parsing, where the main module (expensive when backtracking)is preceded by the pre-analysis (cheap) which simplify the input and generate heuristics whose role is to control the parsing execution (reduction of indeterminism). (By "heuristics" we mean a procedure which guides the parser to make correct choices) Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Correct and effective parsing requires application of several lower level technologies to perform : segmentation (sentences and words), lemmatisation, spell checking, simplification, disambiguation, named-entity recognition. In many cases, complete understanding is not possible on the basis of syntactic analysis alone (some context is to be taken into consideration). Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań The role of discourse analysis is to produce description of discourse organization. It permits to disambiguate those elements which rest ambiguous at the end of syntactic (compositional) analysis. In particular, the knowledge of the discourse structure is necessary for anaphora resolution and therefore contributes to discourse understanding. In POLINT-112-SMS the discourse analysis is performed by the dialogue maintenance module (detection of co-references, solution of anaphora). Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Parsing, as well as generation, depends on basic resources which are grammarsand dictionaries. The grammars POLINT were integrated end directly applied in the project. These grammars were elaborated for successive versions of question-answer systems POLINT produced since the 1990ties. They are formally equivalent to the definite clause grammars (DCGs) directly translated into PROLOG. They were then adapted in the way allowing them to be controlled by heuristics in order to minimize the non-determinism of parsing. The result is that heuristics make parsing executable practically in the linear time. POLINT dictionaries are of the kind of lexicon-grammars. Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań An ontology : PolNet Understanding, but also the reasoning may benefit from lexical data bases of the wordnet type. Reasoning is essential for POLINT-112-SMS because we expect that it has characteristics of an expert system serving as decision-aid to a human agent. For that purpose a precise representation of a real situation must be generated (and visualized). /This is a knowledge engineering task./ Ontologies which permit to systematize knowledge elements about individuals and classes /sets/ of individuals using attributes, relations, associations are useful knowledge engineering tools for these purposes. (cf. Linneus). Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań (Wikipédia, "Réseausémantique") Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Formal ontologies are considered as means of "formalization of conceptualization" (Gruber). Ontologies may serve as reasoning support because of their mathematical structure, e.g. hierarchies of concepts which e.g. permit to implement the default reasoning and inheritance. We have made the choice in favor of WordNet-like ontologies. The term "WordNet" refers to the lexical base created at the Princeton University (1985, George A. Miller) also known as the Princeton WordNet (PWN). Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań The main idea is simple : to gather together synonyms into equivalence classes (which represent concepts) and to consider relations holding between these classes. In practice, this idea is difficult to implement because of the phenomenon of word polysemy. One has to consider disambiguated words (or more precisely : word+word_sensepairs) instead of words. The equivalence classes of synonymous disambiguated words are called synsets. Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Another major issue with wordnets is due to the fact that the language of a given community represents the conceptualization which is specific for this community. Therefore a wordnet created for this language will not necessarily be isomorphic with the Princeton WordNet. E.g. the Polish language does not make a distinction similar to the French distinction between the concept offleuveand the concept of rivière(this is the reason of typical mistakes of Polish students of French which use to say that „Seine estunerivière” instead of „Seine est un fleuve” ). Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań The above mentioned problems were at origin of our decision to create our wordnet from scratch. This decision resulted with PolNet (the full name of the project is "PolNet-Polish Wordnet") . PolNet is free distributed for non-commercial usage (version v1.0), under the Creative Commons license: Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. Izmit, June 12, 2014
Natural LanguageTechnologies atthe Facultyof Mathematics andComputer Science of the Adam Mickiewicz University inPoznań The PolNet synsets are linked by relations. The two main relations are hyponymy et hyperonymy for noun synsets and semantic roles for verbal synsets. The selection of concepts to be represented PolNet was done on the basis of word frequences observed in the National Corpus of Polish (the IPI PAN, Przepiórkowski) and in small expermental corpora collected within the project. The word meaning identification was done manually on bases of traditional dictionaries of Polish. Also synset creation was done manually by lexicographers assisted by the specialised software, namely the DEBVisDic system made at the Masaryk University of Brno (Czech Republic /Pala, Rambousek/). Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań The computer-aided manual processing supported by DEBVisDic permitted us to obtain the quality impossible to reach in the wordnet systems done entierly or mainly automatically / statistically . The quality is however to be payed at high cost of human experts work and verification. The PolNet evaluation was done at 3 levels : • on-line manual evaluation at the coding time, • with the help of a software tool (WQuery, Kubis) • and within the POLINT-112-SMS application. Izmit, June 12, 2014
Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Initially, PolNet (v.0....) was containing only the noun synsets composed of simple words. Now • Nouns : about 11,700 synsetsfor 20,300 meanings (for 12,000 common nouns). • Verbs : in 2011, the a verbal part of PolNet consisted of env.1,500 synsets, (for 900 verbs). The verbal part is in development. Works in order to include compound nouns (in particular the verb-noun collocations) are also in an advanced phase. For reference, we bring the reader’s attention to the fact that the basic vocabulary sufficient to satisfy the needs of ordinary, every-day conversation has been evaluated for 1000-2000 mots. (According to Ogden /1930/ the size of the Basic English is about 850 words). Izmit, June 12, 2014
Natural LanguageTechnologies atthe Facultyof Mathematics andComputer Science of the Adam Mickiewicz University inPoznań Izmit, June 12, 2014
THANKS ! Izmit, June 12, 2014