Corpus and Computational Linguistic Methods and Tools beyond corpus linguistics in CLARIAH

Corpus and Computational Linguistic Methods and Tools beyond corpus linguistics in CLARIAH Jan Odijk Birmingham, 2017-07-24

Overview • Introduction • CLARIN-NL • NLP Tools • Dedicated Applications • 5 Example Cases • CLARIAH-CORE • Text Structured Data • CLARIAH-eScience Projects • Research Pilots • Conclusions

Introduction CLARIN: European research infrastructure for researcherswhoworkwithlanguage resources DARIAH: European research infrastructure for researcher for the Arts and Humanities

Introduction • CLARIN-NL 2009 -2015 • CLARIN-TCC Talk of Europe 2014-2015 • 3 internationalcreativecampsaroundtheEuropean Parliament Data curated as Linked Open Data • CLARIAH contributesto CLARIN + DARIAH • CLARIAH-SEED 2014-2015 • CLARIAH-CORE 2015-2018 • 3 core disciplines (linguistics, socialeconomichistory, media studies) • CLARIAH-PLUS 2019-2024 (submitted)

Introduction • Independent but Relatedprojects • CKCC (Circulation of Knowledge and Learned Practices in the 17th-century Dutch Republic) • DevelopedePistolarium, a web-based Humanities’ Collaboratory on Correspondences • Mainfundingfrom NWO, CLARIN-NL funded a small part • Nederlab 2012-2018 • Develops a web application for thelongitudinalstudy of Dutch language, literatureand culture • Mainfundingfrom NWO, CLARIAH funds small part

Introduction < 2009: Use of corpora anduse of NLP tools limitedtocomputationaland corpus linguists CLARIN-NL and CLARIAH: make these accessibletoandusablebyother humanities researchers

NLP Tools • TTNWW project (together with Flanders) • Predefined workflows of NLP tools as web services • In a easy to use web application • Upload data, push button, download results • All based on CLAM web service mediator • And FoLiAformat for linguistically annotated corpora (de facto standard in NL).

NLP Tools • TTNWW pre-defined workflows: • Orthographic normalisation (spelling & OCR correction) through TICLLops • Tokenisation, lemmatisation, pos-tagging, named entity recognition (NER), limited multiword unit recognition, limited dependency relations (Frog) • Tokenisation, lemmatisation, pos-tagging, limited multiword unit recognition, full syntactic parses, limited NER (Alpino) • Semantic role labelling • Co-reference assignment • Speech conversion and transcription

NLP Tools • Other NLP tools • Adelheid: pos-tagger for 13th century Dutch • INPOLDER: (experimental) parser for 13th century Dutch • FrogGen: user-trainableFrog (language-independent) • Tested on ClassicalGreek • Soontobetested on 17th Century Dutch • WordVec, GloVEsemantic spaces based on Dutch SoNaRcorpus

Dedicated Applications Dedicated (search) applications or (meta)data standardised: Literary studies: COBWWWEB, BNM-I, Arthurian Fiction, C-DSD Song Database (Liederenbank), EMIT-X History: INTER-VIEWS / IPNV, Oral History ( CLARIAH Media Suite), VerrijktKoninkrijk (VK), Dutch Ships and Sailors (DSS), CKCC

Dedicated Applications Political Science: War in Parliament (WIP), CLARIN-TCC Talk of Europe Media Studies: Polimedia, AVResearcherXL& TrOve (Media Suite), NISV Academia collection Religion Studies: PILNAR

Case 1: Linguists • Parseand Query (PaQu) andGrETEL • Access toLASSY andSpoken Dutch Corpustreebanks • Upload one’sown (Dutch) corpus, have itparsedand made searchable • Search • Via a dedicated interface for grammaticaldependencies (PaQu) • Via anexample-based interface (GrETEL) • And via XPATH queries (both) • Extensive Analysis options on data and metadata • Support for multiple formats (FoLiA, TEI, plaintext, CHAT, …)

Case 1: Linguists • OpenSoNaR& AutoSearch • Access to • Token-annotatedSoNaRwritten Dutch corpus (540m) • One’sown token-annotated corpus (AutoSearch) • Exploration interface • Multiple search interfaces of varyingcomplexity • Upgrade + access to full Spoken Dutch Corpus in OpenSoNaR+ (tobereleased in autumn 2017)

Case 2: Philosophers • @PHILOSTEI: • Philosopher & computational linguist • OCR-correction, conversionto TEI for (non-Dutch) philosophicalworks • Based on TICCLops, extended and made language-independent • Basis for VICI project by Arianna Betti (UvA) • Ideas at scale – Towards a computational history of ideas (e-Ideas) • a tool that allows you to trace how ideas such as tolerance, evolution, or science have changed throughout history

Case 3: Historians • WAHSP / BILAND: CLARIN-NL textminingapplications, replacedbyTexCavator • Basis for NWO HorizonTranslantis project 2013-2018 by the same research team • uses digital humanities tools to analyze how the United States has served as a cultural model for the Netherlands in the long twentieth century • AndtotheShiCo project (with NL eScience Centre) • Mining shifting concepts through time

Case 4: LiteraryScholars • NameScape • Search and visualiseNamed Entities in modern Dutch novels • NE Recognition in one’sowncorpora • Through a web applicationwith a dedicated interface for literaryscholars

Case 5: Linguists + LiteraryScholars • Language Dynamics of the Dutch Golden Age • language innovations partly driven by migration, literary innovations and standardisation processes • Variation within authors and genres • Closely collaborates with Nederlab • Uses CLARIN standards and tools • FoLiA, FrogGen, … • AutoSearch

Case 5: Linguists + LiteraryScholars

OtherExamples • More linguistic search / analysis applications: • MIMORE • Search / analysis in multiple dialectal databases / corpora • FESLI • Search in enrichedSpecific Language Impairment (SLI) corpora • COAVA • Combined search in dialect lexicons andCHILDES corpora • Stylene • System for stylometry and readability research • Religion Studies: • SHEBANQ • A web application to perform linguistic queries on the WIVU Hebrew Text Database

CLARIAH-CORE Core disciplines: linguistics, socialeconomichistory, media studies Cross-discipline information extractionfromtext (text -> structured data) Research Pilots Projectswith eScience Centre

Text Structured Data • If buildings could talk • we explore linking the enriched buildings dataset to information extracted from newspapers, aiming to build towards a rich and varied source on the history of buildings. • Distilling careers • augmenting biographies with occupational information based on HISCO byanoccupation tagger • Experiments in fine-grained entity typing for Dutch • Fine-grained tagger (59 / 269 NE types)

CLARIAH-eScience • ADAH Call (Accelerating Discovery in the Arts and Humanities) • Bridging the gap: Digital Humanities and the Arabic-Islamic corpus • seeks to develop a web-based application that will • enable easy access to existing Arabic corpora on online repositories and offer the opportunity for researchers to upload their own corpus • offer a set of tools for Arabic text mining and computational analysis, and • provide opportunities to link search results to otherdatasets in Islamic and Middle Eastern Studies.

CLARIAH-eScience • TICCLAT: Text-Induced Corpus Correction and Lexical Assessment Tool • Builds on TICCL • extend TICCL's correction capabilities with classification facilities based on Nederlab corpus data: word statistics, document and time references and linguistic annotations, i.e. Part-of-Speech and Named-Entity labels.

CLARIAH-eScience • EViDENse: Ego Documents Events modelliNg - how individuals recall mass violence • new ways of analysing and contextualising historical sources by applying state-of-the-art entity and event modelling and semantic web technologies. • Tested in two case studies

CLARIAH-eScience • NewsGac: News Genres: Advancing Media History by Transparent Automatic Genre Classification • Automatic genre detection in newspapers and television news using machine learning. • revises our current understanding of the interrelated development of genre conventions in print and television journalism; • Metrics and guidelines for evaluating the bias and error of the different preprocessing and machine learning approaches and of-the-shelf software packages; • A dashboard that integrates, compares and visualises different algorithms and underlying machine learning approaches which can be integrated in the CLARIAH media suite.

Research Pilots • DB-CCC: Diamonds in Borneo: Commodities as Concepts in Context • detect the diamond mining, manufacturing and trading places and people in Borneo based on a selection of texts from Delpher using Entity recognition, classification & linking and Ontotagger • HHUCAP: The History of Human Capital • Robust Semantic Parsing and Linked Data conversion tools to automatically derive career patterns from 35,000 biographies in the Biography Portal in the period 1815-1940.

Research Pilots • LinkSyr: Linking Syriac Data • How do the Biblical heritage and Hellenistic culture interact in the oldest documents of Syriac Christianity? • compare the Hebrew Bible and its ancient Syriac translation (the Peshitta) with the Syriac Book of the Laws of the Countries (ca. 200 AD) using linguistic data processing, especially topic modelling.

Research Pilots • SERPENS: Contextual search and analysis of pest and nuisance species through time in the KB newspaper collection • SERPENS aims to study the historical impact of pest and nuisance species on human practices and changes in the public perception of these animals. • The KB newspaper collection will be primary source of information to study this. Problems: spelling variations, vernacular vs. Latin names, ambiguity • To remedy this, the WP2-3 diachronic lexicons will be used for query expansion in combination with topic modelling to filter out irrelevant results.

Overview • Introduction • CLARIN-NL • NLP Tools • Dedicated Applications • 5 Example Cases • CLARIAH-CORE • Text  Structured Data • CLARIAH-eScience Projects • Research Pilots • Conclusions

Conclusions • CLARIN-NL & CLARIAH projects • Enabledandstimulateduse of corpus andcomputationallinguisticmethodsand tools in other humanities disciplines • Manyprojectssuccessfullyfinished • Manystillongoing or aboutto start

More information http://portal.clarin.nl http://www.clariah.nl Odijk & Van Hessen (eds.) toappear 2017. CLARIN in the Low Countries. London: Ubiquity Press. (Open Access). Spyns & Odijk (eds.). 2013. Essential Speech and Language Technology for Dutch. Berlin: Springer. Open Access DOI: 10.1007/978-3-642-30910-6 12

Thanks for your attention

Corpus and Computational Linguistic Methods and Tools beyond corpus linguistics in CLARIAH

Corpus and Computational Linguistic Methods and Tools beyond corpus linguistics in CLARIAH

Presentation Transcript

Discourse, news representations and Corpus Linguistics

Corpus Linguistics: Introduction

Corpus Linguistics

Corpus Linguistics

N-Grams and Corpus Linguistics

Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department o

Corpus Linguistics

N-Grams and Corpus Linguistics

N-Grams and Corpus Linguistics

N-Grams and Corpus Linguistics

Corpus Linguistics: Introduction

Corpus linguistics and language teaching

Current trends in corpus linguistics

Introducing Corpus Linguistics

Corpus Linguistics

A Corpus Based Computational Linguistics

Corpus Linguistics 2012

Corpus Linguistics (2)

Corpus Linguistics

Corpus Linguistics (6)

Corpus linguistic What is corpus linguistic?

Corpus Linguistics