480 likes | 500 Views
. Language Archiving at the MPI. Peter Wittenburg MPI for Psycholinguistics D O B E S Archive. NL. G. (DOkumentation BEdrohter Sprachen Documentation of Endangered Languages) (funded by VolkswagenFoundation). Rhein. Nijmegen. Language Archiving at the MPI. .
E N D
Language Archiving at the MPI Peter Wittenburg MPI for Psycholinguistics DOBES Archive NL G (DOkumentation BEdrohter Sprachen Documentation of Endangered Languages) (funded by VolkswagenFoundation) Rhein Nijmegen Language Archiving at the MPI
Still a large variety of languages • currently 6500 languages world-wide • Distribution • Africa 1995 • S/SE Asia 1400 • Neuguinea 1109 • Southamerica 419 • North-Asia 380 • Central-America 300 • Pazific Area 250 • Australia 250 • North-America 209 • Europe 209 Language Archiving at the MPI
Language endangerment • 97 % of the people use 4% of the languages • 96% of the languages are being spoken by 3% of the people • approx 6000 of the languages are spoken by about 200 Mio • people • in average: 30.000 speaker per language • for 50% less than 10.000, for 25% less than 1000 • for 50% the number of speakers is decreasing dramatically • pessimistic view (according to Crystal): • 90 % of the languages will be extinct around 2100!! • i.e. every second week a language becomes extinct!! Language Archiving at the MPI
what can we do? • Documentation + Revitalization • 2000 DOBES Programme of the VolkswagenFoundation • many other initiatives and institutions – all to be complementary • VolkswagenFoundation is devoted to primarily support research • teams get funds for documentation (in general 3 years +) • had a very intensive pilot phase full of useful discussions • it was obvious that all teams felt the need to help the language • communities (including the archiving team) Language Archiving at the MPI
How to do a language documentation? • based on N. Himmelmann “Documentary and Descriptive Linguistics” • Documentation: primary focus is on collection, transcription and • translation of primary data (observations, elicitations, ...) • Description: primary focus is on linguistic analysis and special phenomena • the methods and the results are different Language Archiving at the MPI
How to do a language documentation? • there is an overlap between the two poles: documentation and description • no interlinear description without a morphological analysis • Documentation has to • deliver a comprehensive representation of the “linguistic habitudes and • traditions” • document spoken language in its communicative and cultural • background • observed linguistic habitudes and meta knowledge • holistic view of language is important • be interesting for other disciplines – in particular primary data • help the language community • therefore a natural focus on audio&video recordings Language Archiving at the MPI
DOBES language documentation • language on its cultural background • “theory-neutral” representation • lots of multimedia (audion, video) recordings as basis • where possible base everything on primary data • linguistic goals • annotations (orthographic transcription, translation, ...) • only for a small part a morphological/syntactical analysis • sketch grammar, limited topic-oriented lexicon • also ethnologists, musikologogists, ethnobiologists involved • in total about 3 years • idea: later generations should be able to reproduce the language • material could later be extended Language Archiving at the MPI
Traditional annotation Text Annotation
Modern annotation Multimedia Annotation
DOBES Map Svan/Udi/ Tsova-Tush Chintang/Puma Tofa Nenets Archiv Sami Hocank Beaver Wichita Mawe/Bakairi/ Katxuyana Salar/Monguor Chontal Totoli Lacandon Tsafiki Sri Lanka Malay Kuikuro Bora/Ocaina Semang Teop Trumai Saliba Waima’a Aweti Chipaya Akhoe Hai//om !Xoo Iwaidja Marquesan Chaco Languages Jaminjung • 30 documentation teams (at MPI also 30 expeditions per year) • 1 Archiving Team Language Archiving at the MPI
Labial (Post-) alveolar Velar Glottal Stops voiceless unaspirated (p) t k ' voiceless aspirated (ph) th kh voiceless ejective p' t' k' voiced b d g Fricatives plain (f) s h ??? s' Nasals plain m n ??? mh nh ??? m' n' Laterals plain l ??? lh ??? l' Tap / trill r rr Glides plain w ??? wh ??? w' Waima’a (East Timor) MauricioBelo, Caisido village John Bowden, Australian National University John Hajek, University of Melbourne Nikolaus Himmelmann, Ruhr-Universität Bochum la enen i at before PTL Once upon a time bu taha k’omu ruo bu wai-dura loo ligasaun ini HON mud ball and HON cricket make closeness RCP A ball of mud and a cricket were friends sire ruo laka khuu rahmhutu busa 3p two go clean together garden The two of them went to clean the gardens together
Trumai (Amazonas) • Stephen Levinson, MPI Nijmegen • Raquel Guirardello-Damian, Museu Paraense Emílio Goeldi • about 100 people • about 51 speaking Trumai Language Archiving at the MPI
Salar/Monguor (China) • Shaman in Huzhu • Mongghul county • Drummers in the Nadun festival • Minhe county • Salar villages along the • Yellow River • Salar children above • Dashyinix village Painting the faces of possessed Wutu, Niandehu township Language Archiving at the MPI
Tofa, Tozhu, Tsengel Tuva, Tuha (Sibiria) • David Harrison (Yale) • Brian Donahoe (Manchester) • Sven Grawunder (Halle) • Language—its structure • and sounds. • Oral folklore—texts, • narratives and personal • stories, belief systems, • naming systems. • Music—singing and • sound mimesis. • Traditional ecology— • nomadsm, pastoralism, • hunting and reindeer • herding Shaman Ceremony Language Archiving at the MPI
Language documentation for whom? • for interested researchers • for students and schools • for journalists • for the interested public • for the language communities • for future generations Language Archiving at the MPI
For language communities • language maintenance or even revitalization • maintainenance of the language, identity, self-conciousness • creation of school and other educational material • support local/regional centers (create and dl complete copies) • improve access to archives • in communities big interest in recordings – in particular video Language Archiving at the MPI
For future generations • in a future world of mono cultures it will become important to know about earlier diversity • as now it will be important to know the own roots • it may be relevant to point to the different types of languages • let’s be honest: we don’t know what future generations will do with the • material Language Archiving at the MPI
Why archives? • many reasons • Dietrich Schüller: 80% of our recordings about culture and • languages are endangered! • storage inadequate (Meda, Formats, PC, ...) • selection of suitable technologies requires expert knowledge • creation of redundat storage and migration is important • requires discipine and has to be independent on persons • migration to new technologies can be very expensive • only centres can do this • AND: requires explicitness – at the end a viewable corpus • international trend: • DOBES, AILLA, ELAR, PARADISEC, LACITO, ... Language Archiving at the MPI
What is a “modern” digital archive? • traditional archives • focus on preserving physical content • access not permitted • digital archives • physical object is almost irrelevant (Tape, CDROM) • content has to be preserved • why this revolutionary change? • copies can be made lossless (let’s be careful with compression) • copies can be created with low costs • modern digital archive • long-term preservation fo the content (Migration, Distribution) • access to the content • enrichement without affecting the content • sensitive management of access • DOBES has to be a living archive (interactive, expandible) Language Archiving at the MPI
2000 years 1000 years 500 years 250 years 0 years Long-term preservation • can we guarantee survival of bit-streams? NO • we can increase the chances of survival? YES • our storage media are not adequate • how to do it • continuous migration (copies to new generation) • world-wide distribution (now within Germany/NL) • problem of interpretability not solvable • have to take care of ethical/legal aspects • crucial for survival are maintenance costs • all MPI material is available in 7 copies at different locations various e-media clay tablets Language Archiving at the MPI
domain of physical resources conceptual domain of resources Pillars of Digital Archives I • strict separation of physical and logical access layers • physical domain is for System Managers and Archive Managers • and changes • logical domain (created by linguists) remains and is stable • metadata is the glue – have to be maintained system manager corpus manager user creator Language Archiving at the MPI
Pillars of Digital Archives II Archive Organization Layer of Language Layer of Sessions Song Book Video Recording Intro Films Notes Sound Recording Lexicon Annotations Language Archiving at the MPI
Pillars of Digital Archives III • separation between object and instance • need Unique Resource IDs • and robust “Resolving” mechanism MPI Repository mapping MPI Portal Metadata mapping GWDG Repository mapping XYZ Portal Metadata URID Resolver Language Archiving at the MPI
Pillar of Digital Archives IV • need Versioning • nothing may be deleted, but annotations will be changed! • research world is dynamic – we want enrichment/extension userx=read usery=read etc userx=write usery=read etc URID Resolver Language Archiving at the MPI
Principles V – Authentication&Authorization • authentication and authorization has to be separated • URIDs are central link to authorization information • need to have space for policies, procedures, declarations etc • but administrative effort has to be minimized!!! userx=read usery=read etc userx=write usery=read etc URID Resolver Language Archiving at the MPI
Principles VI – Formats • only open, well-documented and widely used formats (encoding standards) should be used in the archive • where possible generic schemas should be the basis • in DOBES strong recommendations for a few archival formats • JPEG/TIFF/PNG, MPEG2, Linear PCM, UNICODE, XML • Plain Text, HTML, (PDF) possible • at MPI less restrictive (therefore great danger with some types) • for presentation purposes also MPEG1/4, MP3, HTML • as import formats large variety (Shoebox, CHAT, WORD, ...) • conversion as much as possible towards generic files (LMF, EAF, ?) • archived objects have to be stored in a neutral way and accessible as individual objects • no encapsulation for primary objects • nevertheless: MPI archive takes almost all data (even 16mm films) • but conversion can be very costly Language Archiving at the MPI
MPI Archive – state • more than 150.000 Objects (in online archive - ~1/3 of the data) • in total more than 15 TB • per year about 4 TB in addition • several sub-archives (EL, SL, ESF, CGN, ...) • MPI archive ingest is open for other people !!! • completely structured by open XML files based on IMDI schema • a complete machinery available • are working on URIDs & Versioning at this moment Language Archiving at the MPI
Archive Utility Layer Ontological Knowledge User Authentication Access Rights Metadata Tools Archive Access Annotation Exploitation Lexicon Exploitation Text Exploitation Data Ingestion& Management Archive Enrichment Lexical Encoding Web Commentary Media Annotation MPI Archive – Access The Archive Domain of Registered Primary and Secondary Resources User Domain of Descriptive Metadata Primary Resources: Texts Images Sound Movies
MPI Archive – Metadata and Simple Access • metadata is open! • what is minimal metadata? – ongoing discussion • IMDI Editor • BatchModifier (to change lots of IMDI files) • IMDI XML Browser (operates in distributed XML domain) • IMDI HTML Browsing (on the fly transformation of XML) • structured search in XML and HTML domain • unstructured search in XML and HTML domain • searchable via Google • geographic browsing via Google Earth (work in progress) • DC/OLAC bridge via OAI port (all IMDI stuff can be harvested) • manuals and training courses • direct access to simple objects via plug-ins • complete sub-tree download Language Archiving at the MPI
Geographic Browsing
Geographic Browsing
Geographic Browsing
MPI Archive – Upload Access • two options • manual integration exceptions are easy • too many teams (~60) • LAMUS controlled integration exceptions are difficult • users do it themselves (?) • LAMUS features • - web-based operation • - request of a work space • - specification of an accepted upload node (archive anchor) • - extend and manipulate the corpus structure • - upload metadata descriptions • - upload any type of resources (configurable format control) • - create a linked sub-archive in the workspace and integrate this into the archive • - checks to guarantee consistency and format compliance Language Archiving at the MPI
MPI Archive – Utilization Access tool is ANNEX Language Archiving at the MPI
MPI Archive – Utilization Access tool is LEXUS Language Archiving at the MPI
MPI Archive – Utilization Access • Problem • different structures and formats • different terminologies tools are ANNEX/LEXUS Language Archiving at the MPI
MPI Archive – State of Access • at this moment almost anything from DOBES is closed • lots of requests by journalists • first 15 teams have to finish these months • working hard • changing a lot until last minute of course • expect some stuff to become open • but much to be handled on requests Language Archiving at the MPI
End Mark Abley (Canadian) Each time we lose a language the ghosts who made use of it cast a new bell. The voices magnify. Soon, listen, they’ll outpeal the tongues of earth. Thanks for your attention. Language Archiving at the MPI
Lots of differences • Differences at all linguistic layers • Phonemic • Prosody • Phonology • Morphology • Syntax • Semantics • Pragmatics • Reduced Languages • Whistling of Gomera fishermen • Sign Language of Plains-Indians • “Computer” Languages • ... Language Archiving at the MPI
Sound Systems Vocal – Distribution (28 languages) Spectra and Formants F2 F1 F2 F3 F4 F5 F1 Formants over time F5 • Rotoka (Papua-Neuguinea) • Vokals a/e/i/o/u • 6 Consonants p/t/k/v/r/g • !Xoo (South-Africa) • 141 Sounds incl. click-sounds F4 F3 F2 F1
i Zeug i vermuten i Stuhl/Sessel i Bedeutung Tone Systems • modulation of segmental information • by Prosody • stretches across phrases and sentences • Tones: meaning of words • Swedish: 2 Tones (anden – ándén) • German: aufbäumen – aufBäumen • Mandarin Chinese: 4 tones • Kantonese: 9 tones • Vietnamese: 8 tones • some so-aisan languages: up to 15 tones Intonation dr ai st Mandarin Chin. 4 Language Archiving at the MPI
verb stem Morphosyntax • Rules for the generation of words and grammatical structures • strictly isolating languages: one morpheme – one word • Chinese is an isolating language • another extreme are the polysynthetic languages • example of the Yup’ik inuit • tuntussurqatarniksaitengqiggtuq • tuntu ssur qatar ni ksaite ngqiggt uq • Renntier jagen FUT sagen NEG wieder 3SG:IND • er hatte noch nicht gesagt dass er Renntiere jagen wolle • basic principle: stem is inflected by many affixes • for us unusual: isolated core morphemes cannot be interpreted • “ssur” uttered in isolation does not make sense Language Archiving at the MPI
Dialog style • norms to express things/activities is different • example from Kilivila (Trobriand Islands – Neuguinea) • Person:Ambeya • Where do you go to? • Gunter:(wants to say: I will wash myself) • Bala bakakaya • I will go I will take a bath • Host: • Bila bikakaya bike’ita bisisu bipaisewa • 3.Fut-gehen 3.Fut-baden 3.Fut-zurückkommen 3.Fut-sein 3.Fut-arbeiten • He will go – he will take a bath - he will come back – he will stay - • he will work. • He will take a bath, come back again and work with us Language Archiving at the MPI
Pronoms • in Kilivila • the inclusive and exclusiveDual • we two – myself and the others except you • in Paamese (Vanuatu - Archipel) • in addition thePaukal • “a few” Language Archiving at the MPI
Spatial orientation absolute system above above behind north egocentric system east right west south below • Herberger would use the egocentric system to describe the scene • Aborigines would chose the absolute system – for us hardly possible: • “the ball lies east of the player” Language Archiving at the MPI
Awareness • since 1866 efforts to preserve diversity in nature • 1991 problem in focus of American Linguistic Society • 1992 discussion at the Intern. Conference of Linguistics • 1992 AG for endangered languages in German linguistic society • 1993 UNESCO project to create the red list • 2000 DOBES programme of the VolkswagenFoundation • within 2 decades broad awareness amongst linguists • David Crystals amongst first semester students: • 75% don’t know anything about the problem • most don’t see a problem • how does this come: • attention for tigers etc but not for languages? Language Archiving at the MPI
Factors are known • external factors • military suppression • religious conversion • economic dominance • cultural dominance • educational suppression • internal faktors • negative attitude towards own language • avoidance of discrimination • hope to earn (more) money • improvement of mobility • youngsters are trend followers • ... Language Archiving at the MPI
MPI Archive – Content Overview