390 likes | 406 Views
Explore the role of computational linguistics (CL) in the digital age, from NLP to linguistic knowledge formalization. Discover tools and concepts for language technology and text evaluation.
E N D
FormalModellingof Natural Language Kais Allkivi-Metsoja Information Society Approaches and ICT Processes 10.04.2019
“The idea of giving computers the ability to process human language is as old as the idea of computers themselves.” (Jurafsky & Martin, 2009) process = understand + produce
To get started… Questions to you: • Which language technology tools do you know? • Which tools have you used yourself? (Maybe even recently?) Why and how?
Outline for today • WHAT? Main concepts related to Computational linguistics (CL) • HOW? Levels and types of language processing • WHY? CL’s contribution for the smart society • WHAT EXACTLY? CL in language learning/ teaching automated text evaluation
Dear child has many names • Computational linguistics (CL) • (Human) language technology (LT) • Natural language processing (NLP) • Speech and language processing • Language engineering • …
Corpus linguistics • Discipline between linguistics and computer science • Study of human language from a computational perspective, i.e., with the help of and for computers • Provides computational (rule-based or statistical) models of linguistic phenomena
Language technology • Applied side of CL • Uses formal models of language for practical purposes • Broad definition: • 1) Process and methods of applying knowledge of human language to create software systems; • 2) The resulting computer programs and electronic devices • Narrow definition: • Set of software systems designed to handle natural language
So what about NLP? • A method to translate between computer and human languages. (Technopedia) • Study concerned with the interactions between computers and human languages, in particular how to program computers to process and analyse large amounts of natural language data. (Wikipedia) • Coincides with the methodological aspect of LT.
LT in the context of ICT (META-NET series “Europe's Languages in the Digital Age”, for Estonian: Liin et al., 2012)
Levels of linguistic knowledge • Phonetics and Phonology – knowledge about linguistic sounds; • Morphology – knowledge of the meaningful components of words; • Syntax – knowledge of the structural relationships between words; • Semantics – knowledge of meaning; • Pragmatics – knowledge of the relationship of meaning to the goals and intentions of the speaker; • Discourse – knowledge about linguistic units larger than a single utterance.
Formalizing linguistic knowledge • Generative grammar proposed by Chomsky (1956) • For sentences like Väike Mari laulabhästi‘Little Mari sings well’ • T (word forms) = {väike:Ad], tubli:Adj, Mari:N, Jüri:N, kirjutab:V. laulab.V. hästi:Adv, meelsasti:Adv} • N (grammatical categories) = {S, VP, NP, N, V, Adj, Adv} • P (transformation rules) = {S -> NP VP, NP Adj NP, NP -> N, VP VP Adv, VP -> V}
Formalizing linguistic knowledge • Constraint Grammar formalism (Karlsson, 1995 et al.) • Gradual exclusion: attaching all possible forms + removing the unfitting ones based on constraints (rules) • Example: REMOVE VFIN IF (0 N) (-1 ART OR <poss> OR GEN); remove a finite verb reading if self (0) can also be a noun (N), and if there is an article (ART), possessive (<poss>) or genitive (GEN) 1 position left (-1). (Bick & Didriksen, 2015)
Morphosyntactic analysis Using EstCG parser (University of Tartu) • Part of speech analysis full morphological analysis shallow syntactic analysis dependency syntactic analysis • "<Mina>” (’I’)"mina" L0Ppers ps1 sg nom cap@SUBJ #1->2 Syntactic tagset: https://korpused.keeleressursid.ee/syntaks/dokumendid/syntaksiliides_en.pdf
Semantic analysis Semantic analysis (English) – using the USAS semantic tagger (University of Lancaster) • Every_N5.1+ human_S2mf has_A9+ a_Z5 unique_N5--- personality_S1.2 ._PUNC Tagset description: http://ucrel.lancs.ac.uk/usas/USASSemanticTagset.pdf
EstCG 1: “light” parsing + disambiguator Iga ‘Every’ iga+0 //_P_ det sg nom #cap // **CLB @NN> inimene ‘human’ inimene+0 //_S_ com sg nom // @SUBJ @ADVL(ambiguity but correct choice) on ‘is’ ole+0 //_V_ main indicpres ps3 sg psaf #FinV #Intr // @+FMV kordumatu ‘unique’ kordumatu+0 //_A_ pos sg nom // @AN> isiksus ‘personality’ isiksus+0 //_S_ com sg nom // @SUBJ @PRD(ambiguity and incorrect choice) . //_Z_ Fst //
EstCG 2: dependency parsing + disambiguator "<s>" "<Iga>” "iga" L0 P det sg nom cap @NN> #1->2 "<inimene>” "inimene" L0 S com sg nom @SUBJ @ADVL #2->3 (ambiguity but correct choice) "<on>” "ole" L0 V main indicpres ps3 sg psaf <FinV> <Intr> @FMV #3->3 "<kordumatu>” "kordumatu" L0 A pos sg nom @AN> #4->5 "<isiksus>” "isiksus" L0 S com sg nom @PRD @SUBJ #5->3 (ambiguity but correct choice) "<.>” "." Z Fst #6->6 "</s>”
EstCG 3: dependency parser + disambiguator "<s>" "<Iga>" "iga" L0 P det sg nom cap @NN> #1->2 "<inimene>" "inimene" L0 S com sg nom @SUBJ #2->3 "<on>" "ole" L0 V main indicpres ps3 sg psaf@FMV #3->0 "<kordumatu>" "kordumatu" L0 A pos sg nom @AN> #4->5 "<isiksus>" "isiksus" L0 S com sg nom @PRD #5->3 "<.>" "." Z Fst CLB #6->6 "</s>"
Classification of LT • Form of language (text/speech) • Function (mostly analysis/synthesis) • Monolingual (language-specific) vs.multilingual (language-independent) • Auditorium (users vs. researchers/developers) • Underlying methodology (rule-based/ statistical/hybrid)
Rule-based approach • Language models are hand-built encoded representations of linguistic analyses. • E.g., rule-based morphological analysis and synthesis (production of all possible forms of a word) require lists of word stems, case-endings and affixes together with instructions for combining them. • See earlier examples!
Statistical (data-driven) approach • Large collections of example texts are used to train language models through machine learning algorithms. • E.g., binary decisions (correct/incorrect) • one word at a time – classifiers: decision trees, support vector machines, Gaussian Mixture Models, logistic regression • all words in a sequence – hidden Markov models, maxi-mum entropy Markov models, conditional random fields • For spell checkers, morphological and syntactic analysers – single language training data; for machine translation – parallel text collections
Example: recurrent neural network for language generation (Lepik, 2015)
Hybrid approaches • Combining rule-based and statistical approach, e.g., • Feature enrichment – output of rule-based approach as good quality features for a statistics-based model • Rules extraction – a set of rules is extracted from tagged text using machine learning (Soriano Morales, 2013)
Text vs. speech technologies • Speech technologies • Speech recognition, speech synthesis, speaker recognition and verification • Text technologies • Web search, spelling and grammar checking, lemmatization, automated tagging (morphological, syntactic and semantic), text summarisers, sentiment analysis tools (e.g. emotion detectors) • What about dialogue systems and machine translation?
CL and LT in Estonia • Started in the 1950s (machine translation) • Four academic institutions • University of Tartu, Institute of the Estonian Language, Tallinn University of Technology, Tallinn University • Nationally funded, emphasized in development plans and political strategies • Core language resources and LT tools are on a satisfactory level. • Advancements needed in semantic, pragmatic, discourse, multimodal analysis and language generation • Tools: https://www.keeleressursid.ee/en/resources
Implications for smart society • Language is the prime vehicle in which information is encoded, by which it is accessed and through which it is disseminated. (Cole et al., 1997)
What is a smart society? • Smart society “successfully harnesses the potential of digital technology and connected devices and the use of digital networks to improve people’s lives.” (Levy & Wong, 2014) • It is a society where the leaders and citizens thoughtfully deploy digital technology and make data-based decisions, which can improve social well-being, productivity, economic strength, the effectiveness of governing institutions and, ultimately, the quality of life. (Chakravorti & Chaturvedi, 2017; Haupt, 2017; Levy & Wong, 2014)
How can LT contribute?Some examples • Connecting multilingual global society • Machine translation allows to have conversations and browse online content in various foreign languages. • More comfortable and accessible ICT • Speech recognition and synthesis –> AI voice assistants, car navigational systems, smart home appliances; help in case of visual/hearing impairment • Reducing human work load • Dialogue systems answering routine questions, automated assessment tools helping teachers
More examples • Big data analysis • Structuring vast amounts of largely unstructured (textual) data to make sense of them (patterns, connections, trends) • E.g., EstNLTK toolkit for Python: • detecting word, clause and sentence boundaries, lemmatisation, morphological and syntactic analysis, named entity recognition, and defining synonymous (conceptually related) words • Human-level AI development • High-level NLP (text semantics, pragmatics, discourse)
CL and LT at TLU • Research group founded at the former Institute of Estonian Language and Culture • New sub-area of Applied Informatics at the School of Digital Technologies • Focus on learner language analysis • Largest Estonian learner language resource Estonian Interlanguage Corpus (EIC) http://evkk.tlu.ee/?language=en • Tools: morphosyntactic clustering of words / word sequences, character sequence clustering, word frequency listing, syllabification.
Current research aims • To determine the linguistic features that distinguish the communicative proficiency levels (A1, A2, B1, B2, C1, C2) of Estonian as a second and foreign language (L2). • To develop a statistical model for automated assessment of written learner texts (proficiency level + more detailed feedback). • To use these models for develop level-specific e-exercises, language learning games; to compile level dictionaries. • Expected outcomes: • e-learning environment for L2 Estonian (from language learning researcher corpus language learner corpus) • improved quality of language testing and study materials
Estonian Interlanguage Corpus • Collection of texts mostly written by the learners of Estonian as L2. • Development began in 2005. New material constantly added. • Contains 3,5 million words, 12,500 texts • Native speakers of mostly Russian but also Finnish, English, German, Lithuanian, Ukrainian, Hungarian, Polish, Swedish, Latvian, Belarusian. • Essays, personal and official letters, narrative texts, answers to questions, grammar exercises etc.
Our linguistic approach • Usage-based study of Estonian grammar • How does the language evolve, both in history (e.g. the fiction of 1890s and 1990s) and in language acquisition (through levels A1–C2, L2 vs. L1)? • Detecting patterns of language use: • How are a) parts of speech (verbs, nouns, adverbs etc.); b) words with a certain morphological forms (e.g. singular noun in genitive case, 3rd person verb in Past Simple tense); and c) words with a certain syntactic function (e.g. subject, object, adverbial) combined with each other? • Which words/phrases are used in these structures?
Linguistic Cluster Analysis Tool Klastrileidja (Cluster Catcher)
Automated writing assessment • E.g., IntelliMetric and e-rater used for scoring essays, SpeechRater for speech scoring • CEFR-based (A1–C2) grading prototypes developed for German (Hancke, 2013), Czech (Rysová et al., 2016), Swedish (Pilán, 2018), English short answers (Tack et al., 2017) • Promising attempts for predicting the CEFR level of Estonian learner texts based on EIC • Vajjala & Lõo, 2014 (accuracy: up to 79%) • Hallik, 2016 • Kossinski, 2018 (accuracy: appr. 72%)
Let’s give our prototype a try! • http://minitorn.tlu.ee/~jaagup/oma/too/19/03/tasemed1.php (Kossinski, 2018)
Discussion! 15 minutes • What would you as a language learner expect from language technological tools? • Why would you use it? In which situations?
End discussion • Which connections do you see between your research area and language technology? • What suggestions could you give me?