400 likes | 544 Views
The Rosetta Project ALL Language Archive. Presented by: Laura Buszard-Welcher The Rosetta Project / University of California, Berkeley. A Project of the Long Now Foundation & A National Science Digital Library www.rosettaproject.org. Primary Goals.
E N D
The Rosetta ProjectALL Language Archive Presented by: Laura Buszard-Welcher The Rosetta Project / University of California, Berkeley A Project of the Long Now Foundation & A National Science Digital Library www.rosettaproject.org
Primary Goals • Support the documentation of the world’s nearly 7000 languages through building • A digital archive of language documentation • A linguistically sophisticated site that is also useful and interesting for the general public • Networks of speakers, educators, linguists • Contributes to the effort to document endangered languages • Promotes linguistic diversity by educating the public about languages with small numbers of speakers.
Secondary Goals • Support metadata standardization and interoperability • OLAC • EMELD • Develop tools for collaborative linguistic research • Endangered Language Query Room • Wordlist Tool • Collaborative document editing/creation (new site)
Roles • The Long Now Foundation • Parent organization of The Rosetta Project • Projects, seminars on topics that foster long term thinking • The National Science Digital Library • U.S. National Science Foundation Program • Goal is to bring online high quality STEM (Science, Technology, Engineering, and Math) resources for education • Sponsor of Rosetta Project (NSF 333727) • Stanford University • Online and offline storage of Rosetta materials
Project History:The 1000 Language Archive • Initiated by The Long Now Foundation • Wanted to experiment with new microetching technology, looking for suitable content • Decided to collect basic descriptive information for 1000 of the world’s approximately 7000 languages
Why language information? • Most natural human languages are products of millenia of human history (therefore a good long term thinking project) • Repositories of cultural information • Languages showcase • Human intellectual sophistication • Cultural diversity • To draw attention to the critical issue of language endangerment
The Rosetta Disk • Next generation microfiche • Micro-etched 2" nickel disk at densities of up to 200,000 page images per disk • Developed by Los Alamos Laboratories and Norsam Technologies • Reading the disk requires a microscope, either optical or electron, depending on the density of encoding
The Rosetta Stone • Not us! (196 BC) • Parallel text written in three scripts: • Hieroglyphic • Demotic (script form) • Greek • The key to deciphering Egyptian Hieroglyphs
Design of the Disk • Original design has human-eye readable text (Genesis text) and micro-etched text inside an index • New design has human-eye readable text (instructions) on one side and microetched images on the reverse
In-House Scanning • HP CapShare Scanners • Scan printed page in multiple passes, any direction • Page is ‘assembled’ into one image • Stores about 50 pages at a time (300 dpi bitmap .tif) • Uploads numerically sequenced images to computer by infrared port
In-House Scanning • Minolta PS 7000 Overhead • Bitmap and grayscale scans up to 600 dpi • Multiple sizes, orientations • Single page / double page spread (good for text collections with verso annotations) • Best for fragile books, manuscripts that would be damaged by hand scanning
Rosetta Project Web Site • Welcome • Search for a language • Language overview page • Browse (by name, family, country) • Wordlist tool
Projects • Endangered Language Query Rooms • Digital Online Curation Services for Endangered Language Archives (DOCS) • Wordlist Tool • LangGator
Endangered Language Query Rooms http://emeld.rosettaproject.org/
Potawatomi Query Room Re: Bozho by Donald Perrot (host) on July 9 2004, 8:53 PM Nmedagwe'ndan e'gi nebye'ge'yen ngom. Neaseno ndesh ne kas ge' nin, mine E'shkanabe' e'nda ge' nin. I like what you have written. I am called Neaseno (Southwind) myself, and I live in Escanaba, MI. Re: Bozho by Justin Neely on September 7 2004, 1:16 PM Bozho Neaseno mine Lameen Zagnenibi ndeznekas. Nishnabe ndaw ipi Bodewadmi ndaw. Shi shi ban nee yek ndebendagwes. Zego ndotem. Kansas City,Mo ndoch bya. Eskanabe edayen ge nin. Bama pi ngom Zagnenibi ndeznekas [Hello Neaseno and Lameen my name is Zagnenibi. I’m Native and Potawatomi. I belong to the Citizen Band. I’m Crane Clan . I’m from Kansas City, Missouri. I also live in Escanaba. Bye for now, Zagnenibi.]
Taking Conversational Risks by [TL] on July 17 2004, 10:30 AM mbesuk onago ngi zhyamen . nseze wgi bye tot i jiman ewi nepamshkamen be gishek. wabek nuwi zhya men ibe eje shna mbesuk . ngi wabmak gode chemokmanuk demojgewat. wabek nin gezhe ni demojgeyan gnebech. bama mine mtego [I went to the lake yesterday. My brother brought a canoe so we could float around all day. Tomorrow we’ll go there to the lake. I saw the white folks fishing. Tomorrow I’ll fish too, maybe. So long for now, Mtego.] Re: onago egi zhejkeyak by [JN] on July 17 2004, 8:12 PM mbesek ndazhya ngom. Mbish ksenyak shode. Nedwendan ode Mbish gshatek. Megwa Nwinebyege ode bodewadmi kiktowenen bama. Megwetch Zagnenibi nin se [I should go to the lake today. The water is cold here. I wish the water were warm. I’ll write more of this Potawatomi conversation later. Thanks, yours truly Zagnenibi.]
DOCS Project • Digital Online Curation Services for Endangered Language Archives • Many small language archives are beginning to digitize their materials • Lack technical infrastructure to bring resources online • Goal is to provide access through Rosetta
DOCS Project Archives • Endangered Language Fund (ELF) • Survey for California and Other American Indian Languages (SCOIL) • The Alaska Native Languages Center (ANLC) • Max-Planck Institute for Evolutionary Anthropology (Leipzig)
Wordlist Tool • Swadesh lists (100, 200, 207 terms) from: • Tryon's Comparative Austronesian Dictionary (rekeyed) • Tim Usher's Indo-Pacific database (2002 version) • Paul Whitehouse's Australian and New Guinea database (2002 version) • George Starostin's Dravidian database • Ilya Peiros' Mon Khmer database • Total of 1,384 languages, 3,090 lists online • Additional 3000 lists, up to 1850 terms per list, most 300-500 words in length.
LangGator • A linguistic “Wayback Machine” • Language resource location and aggregation • Use alternate language names, spellings • Deutsch, Hochdeutsch, High German, Allemande • Fadicca, Fadicha, Fedija, Fadija, Fiadidja, Fiyadikkya, and Fedicca • Character identification (inventory, distribution) • Dera (Chadic, Nigeria) • Dera (Trans-New Guinea, Indonesia) • Seed crawler with Wordlist terms (see previous slide), weighted towards longer terms • Archiving through Internet Archive • Serve results through the Rosetta site
Collaborations • Electronic Metastructure for Endangered Languages Data (E-MELD) • General Ontology for Linguistic Description (GOLD) • Open Language Archives Community (OLAC)
E-MELD • Electronic Metastructure for Endangered Language Data • School of Best Practice http://emeld.org/school/index.html • Guidelines and examples for putting linguistic data into best practice digital formats • XML with XML Schema or DTD • Mapping terminology to ontology (GOLD) • FIELD lexical database tool http://emeld.org/tools/field/beta/ • Online collaborative tool to build linguistic dictionaries, backed by ontology (GOLD)
GOLD • General Ontology for Linguistic Description • Built in OWL (Web Ontology Language), linked to SUMO (Suggested Upper Merged Ontology) • Best practice resources should include a mapping between the researcher’s terms, and a standard set, known as the ‘profile’ • ‘independent’ (mine) = ‘main clause’ (GOLD) • ‘obviative’ (mine) = ‘fourth person’ (GOLD) • The standard terminology set can then allow sophisticated searches across disparate resources.
OLAC • Open Language Archives Community • Set of 23 metadata elements and controlled vocabularies (based on Dublin Core) • Subject.language (language described, rather than audience language) uses SIL language codes • Type.linguistic (grammar, lexicon, text) • IMDI (Isle Metadata Initiative) has 135 elements • Recommended extensions (Discourse Types, Linguistic Field, Participant roles • Enables searches across a network of archives that use OLAC metadata http://www.language-archives.org/tools/search/
URLs • Electronic Metastructure for Endangered Language Data (E-MELD) http://www.emeld.org (School of Best Practice, FIELD Tool). • Endangered Language Query Rooms http://rosettaproject.org:8080/emeldbase/. • The Ethnologue http://www.ethnologue.com. • General Ontology for Linguistic Description (GOLD) http://www.linguistics-ontology.org. • ISLE MetaData Initiative (IMDI) http://www.mpi.nl/IMDI/. • National Science Digital Library (NSDL) http://nsdl.org • Open Language Archives Community (OLAC) http://www.language-archives.org. • The Rosetta Project, http://www.rosettaproject.org/live. A preview of the new Web site (currently under construction) is available at http://preview.rosettaproject.org.
Credits • This project is funded by the US National Science Digital Library (NSF 333727)