Formal Modelling of Natural Language

Explore the role of computational linguistics (CL) in the digital age, from NLP to linguistic knowledge formalization. Discover tools and concepts for language technology and text evaluation.

Formal Modelling of Natural Language

Presentation Transcript

  1. FormalModellingof Natural Language Kais Allkivi-Metsoja Information Society Approaches and ICT Processes 10.04.2019

  2. “The idea of giving computers the ability to process human language is as old as the idea of computers themselves.” (Jurafsky & Martin, 2009) process = understand + produce

  3. To get started… Questions to you: • Which language technology tools do you know? • Which tools have you used yourself? (Maybe even recently?) Why and how?

  4. Outline for today • WHAT? Main concepts related to Computational linguistics (CL) • HOW? Levels and types of language processing • WHY? CL’s contribution for the smart society • WHAT EXACTLY? CL in language learning/ teaching  automated text evaluation

  5. Dear child has many names • Computational linguistics (CL) • (Human) language technology (LT) • Natural language processing (NLP) • Speech and language processing • Language engineering • …

  6. Corpus linguistics • Discipline between linguistics and computer science • Study of human language from a computational perspective, i.e., with the help of and for computers • Provides computational (rule-based or statistical) models of linguistic phenomena

  7. Language technology • Applied side of CL • Uses formal models of language for practical purposes • Broad definition: • 1) Process and methods of applying knowledge of human language to create software systems; • 2) The resulting computer programs and electronic devices • Narrow definition: • Set of software systems designed to handle natural language

  8. So what about NLP? • A method to translate between computer and human languages. (Technopedia) • Study concerned with the interactions between computers and human languages, in particular how to program computers to process and analyse large amounts of natural language data. (Wikipedia) • Coincides with the methodological aspect of LT.

  9. LT in the context of ICT (META-NET series “Europe's Languages in the Digital Age”, for Estonian: Liin et al., 2012)

  10. Levels of linguistic knowledge • Phonetics and Phonology – knowledge about linguistic sounds; • Morphology – knowledge of the meaningful components of words; • Syntax – knowledge of the structural relationships between words; • Semantics – knowledge of meaning; • Pragmatics – knowledge of the relationship of meaning to the goals and intentions of the speaker; • Discourse – knowledge about linguistic units larger than a single utterance.

  11. Formalizing linguistic knowledge • Generative grammar proposed by Chomsky (1956) • For sentences like Väike Mari laulabhästi‘Little Mari sings well’ • T (word forms) = {väike:Ad], tubli:Adj, Mari:N, Jüri:N, kirjutab:V. laulab.V. hästi:Adv, meelsasti:Adv} • N (grammatical categories) = {S, VP, NP, N, V, Adj, Adv} • P (transformation rules) = {S -> NP VP, NP Adj NP, NP -> N, VP VP Adv, VP -> V}

  12. Formalizing linguistic knowledge • Constraint Grammar formalism (Karlsson, 1995 et al.) • Gradual exclusion: attaching all possible forms + removing the unfitting ones based on constraints (rules) • Example: REMOVE VFIN IF (0 N) (-1 ART OR <poss> OR GEN); remove a finite verb reading if self (0) can also be a noun (N), and if there is an article (ART), possessive (<poss>) or genitive (GEN) 1 position left (-1). (Bick & Didriksen, 2015)

  13. Morphosyntactic analysis Using EstCG parser (University of Tartu) • Part of speech analysis  full morphological analysis  shallow syntactic analysis  dependency syntactic analysis • "<Mina>” (’I’)"mina" L0Ppers ps1 sg nom cap@SUBJ #1->2 Syntactic tagset: https://korpused.keeleressursid.ee/syntaks/dokumendid/syntaksiliides_en.pdf

  14. Semantic analysis Semantic analysis (English) – using the USAS semantic tagger (University of Lancaster) • Every_N5.1+ human_S2mf has_A9+ a_Z5 unique_N5--- personality_S1.2 ._PUNC Tagset description: http://ucrel.lancs.ac.uk/usas/USASSemanticTagset.pdf

  15. EstCG 1: “light” parsing + disambiguator Iga ‘Every’ iga+0 //_P_ det sg nom #cap // **CLB @NN> inimene ‘human’ inimene+0 //_S_ com sg nom // @SUBJ @ADVL(ambiguity but correct choice) on ‘is’ ole+0 //_V_ main indicpres ps3 sg psaf #FinV #Intr // @+FMV kordumatu ‘unique’ kordumatu+0 //_A_ pos sg nom // @AN> isiksus ‘personality’ isiksus+0 //_S_ com sg nom // @SUBJ @PRD(ambiguity and incorrect choice) . //_Z_ Fst //

  16. EstCG 2: dependency parsing + disambiguator "<s>" "<Iga>” "iga" L0 P det sg nom cap @NN> #1->2 "<inimene>” "inimene" L0 S com sg nom @SUBJ @ADVL #2->3 (ambiguity but correct choice) "<on>” "ole" L0 V main indicpres ps3 sg psaf <FinV> <Intr> @FMV #3->3 "<kordumatu>” "kordumatu" L0 A pos sg nom @AN> #4->5 "<isiksus>” "isiksus" L0 S com sg nom @PRD @SUBJ #5->3 (ambiguity but correct choice) "<.>” "." Z Fst #6->6 "</s>”

  17. EstCG 3: dependency parser + disambiguator "<s>" "<Iga>" "iga" L0 P det sg nom cap @NN> #1->2 "<inimene>" "inimene" L0 S com sg nom @SUBJ #2->3 "<on>" "ole" L0 V main indicpres ps3 sg psaf@FMV #3->0 "<kordumatu>" "kordumatu" L0 A pos sg nom @AN> #4->5 "<isiksus>" "isiksus" L0 S com sg nom @PRD #5->3 "<.>" "." Z Fst CLB #6->6 "</s>"

  18. Classification of LT • Form of language (text/speech) • Function (mostly analysis/synthesis) • Monolingual (language-specific) vs.multilingual (language-independent) • Auditorium (users vs. researchers/developers) • Underlying methodology (rule-based/ statistical/hybrid)

  19. Rule-based approach • Language models are hand-built encoded representations of linguistic analyses. • E.g., rule-based morphological analysis and synthesis (production of all possible forms of a word) require lists of word stems, case-endings and affixes together with instructions for combining them. • See earlier examples!

  20. Statistical (data-driven) approach • Large collections of example texts are used to train language models through machine learning algorithms. • E.g., binary decisions (correct/incorrect) • one word at a time – classifiers: decision trees, support vector machines, Gaussian Mixture Models, logistic regression • all words in a sequence – hidden Markov models, maxi-mum entropy Markov models, conditional random fields • For spell checkers, morphological and syntactic analysers – single language training data; for machine translation – parallel text collections

  21. Example: recurrent neural network for language generation (Lepik, 2015)

  22. Hybrid approaches • Combining rule-based and statistical approach, e.g., • Feature enrichment – output of rule-based approach as good quality features for a statistics-based model • Rules extraction – a set of rules is extracted from tagged text using machine learning (Soriano Morales, 2013)

  23. Text vs. speech technologies • Speech technologies • Speech recognition, speech synthesis, speaker recognition and verification • Text technologies • Web search, spelling and grammar checking, lemmatization, automated tagging (morphological, syntactic and semantic), text summarisers, sentiment analysis tools (e.g. emotion detectors) • What about dialogue systems and machine translation?

  24. CL and LT in Estonia • Started in the 1950s (machine translation) • Four academic institutions • University of Tartu, Institute of the Estonian Language, Tallinn University of Technology, Tallinn University • Nationally funded, emphasized in development plans and political strategies • Core language resources and LT tools are on a satisfactory level. • Advancements needed in semantic, pragmatic, discourse, multimodal analysis and language generation • Tools: https://www.keeleressursid.ee/en/resources

  25. Implications for smart society • Language is the prime vehicle in which information is encoded, by which it is accessed and through which it is disseminated. (Cole et al., 1997)

  26. What is a smart society? • Smart society “successfully harnesses the potential of digital technology and connected devices and the use of digital networks to improve people’s lives.” (Levy & Wong, 2014) • It is a society where the leaders and citizens thoughtfully deploy digital technology and make data-based decisions, which can improve social well-being, productivity, economic strength, the effectiveness of governing institutions and, ultimately, the quality of life. (Chakravorti & Chaturvedi, 2017; Haupt, 2017; Levy & Wong, 2014)

  27. How can LT contribute?Some examples • Connecting multilingual global society • Machine translation allows to have conversations and browse online content in various foreign languages. • More comfortable and accessible ICT • Speech recognition and synthesis –> AI voice assistants, car navigational systems, smart home appliances; help in case of visual/hearing impairment • Reducing human work load • Dialogue systems answering routine questions, automated assessment tools helping teachers

  28. More examples • Big data analysis • Structuring vast amounts of largely unstructured (textual) data to make sense of them (patterns, connections, trends) • E.g., EstNLTK toolkit for Python: • detecting word, clause and sentence boundaries, lemmatisation, morphological and syntactic analysis, named entity recognition, and defining synonymous (conceptually related) words • Human-level AI development • High-level NLP (text semantics, pragmatics, discourse)

  29. CL and LT at TLU • Research group founded at the former Institute of Estonian Language and Culture • New sub-area of Applied Informatics at the School of Digital Technologies • Focus on learner language analysis • Largest Estonian learner language resource Estonian Interlanguage Corpus (EIC) http://evkk.tlu.ee/?language=en • Tools: morphosyntactic clustering of words / word sequences, character sequence clustering, word frequency listing, syllabification.

  30. Current research aims • To determine the linguistic features that distinguish the communicative proficiency levels (A1, A2, B1, B2, C1, C2) of Estonian as a second and foreign language (L2). • To develop a statistical model for automated assessment of written learner texts (proficiency level + more detailed feedback). • To use these models for develop level-specific e-exercises, language learning games; to compile level dictionaries. • Expected outcomes: • e-learning environment for L2 Estonian (from language learning researcher corpus  language learner corpus) • improved quality of language testing and study materials

  31. Estonian Interlanguage Corpus • Collection of texts mostly written by the learners of Estonian as L2. • Development began in 2005. New material constantly added. • Contains 3,5 million words, 12,500 texts • Native speakers of mostly Russian but also Finnish, English, German, Lithuanian, Ukrainian, Hungarian, Polish, Swedish, Latvian, Belarusian. • Essays, personal and official letters, narrative texts, answers to questions, grammar exercises etc.

  32. Our linguistic approach • Usage-based study of Estonian grammar • How does the language evolve, both in history (e.g. the fiction of 1890s and 1990s) and in language acquisition (through levels A1–C2, L2 vs. L1)? • Detecting patterns of language use: • How are a) parts of speech (verbs, nouns, adverbs etc.); b) words with a certain morphological forms (e.g. singular noun in genitive case, 3rd person verb in Past Simple tense); and c) words with a certain syntactic function (e.g. subject, object, adverbial) combined with each other? • Which words/phrases are used in these structures?

  33. Linguistic Cluster Analysis Tool Klastrileidja (Cluster Catcher)

  34. Automated writing assessment • E.g., IntelliMetric and e-rater used for scoring essays, SpeechRater for speech scoring • CEFR-based (A1–C2) grading prototypes developed for German (Hancke, 2013), Czech (Rysová et al., 2016), Swedish (Pilán, 2018), English short answers (Tack et al., 2017) • Promising attempts for predicting the CEFR level of Estonian learner texts based on EIC • Vajjala & Lõo, 2014 (accuracy: up to 79%) • Hallik, 2016 • Kossinski, 2018 (accuracy: appr. 72%)

  35. Let’s give our prototype a try! • http://minitorn.tlu.ee/~jaagup/oma/too/19/03/tasemed1.php (Kossinski, 2018)

  36. Discussion! 15 minutes • What would you as a language learner expect from language technological tools? • Why would you use it? In which situations?

  37. End discussion • Which connections do you see between your research area and language technology? • What suggestions could you give me?

