390 likes | 509 Views
A Field Linguist’s Guide to Making Long Lasting Texts and Databases. LSA Organized Session January 4, 2007 Anaheim, California. Organized by: Jeff Good and Heidi Johnson Open Language Archives Community (OLAC) Outreach Committee Moderator: Laura Welcher Speakers: Debbie Anderson,
E N D
A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California
Organized by: Jeff Good and Heidi Johnson Open Language Archives Community (OLAC) Outreach Committee Moderator: Laura Welcher Speakers: Debbie Anderson, Michael Appleby, Jessica Boynton, Naomi Fox, Connie Dickinson
Presentations from this session will be posted athttp://www.language-archives.org/news.html#olac07
Best Practice in Your Back Pocket: Getting the Most Out of the Tools You Have Laura Welcher The Rosetta Project / Long Now Foundation
A great way to freak out a linguist “To be in compliance with best practice recommendations (ahem), your interlinear glossed text needs to be in XML format with morphosyntactic tags that reference the GOLD ontology.”
Reality Check • There’s a difference between ideal best practice resources (which is still somewhat of a moving target) and a good, sufficient approximation. • Some common practices are far from ideal or sufficient (like saving the dictionary you worked 5 years on as a Microsoft Word document file). • We can easily modify these practices to produce archivable resources that will last. • And this can be done using tools that you already have, and knowledge that is easy to acquire. • Hence the title: Best practice in your back pocket: getting the most out of the tools that you have.
Best Practice • E-MELD project (Electronic Metastructure for Endangered Languages Data) • Goals: • Help preserve endangered languages data • Develop infrastructure for electronic archives • Defining best practice • E-MELD summer workshops http://www.emeld.org • Promoting best practice: • “School of Best Practice” at http://www.emeld.org/school/index.html
Good, Better, Best Practice • The information presented here comes from presentations of the E-MELD team, particularly the following: • Simons and Dry (2006) Good, Better, and Best Practice The Experience of the E-MELD Project http://www.linguistlist.org/emeld/documents/Bielefeld-Dry-Simons.pdf
The first consideration:working, presentation and archival formats • The process of creating digital language resources usually involves creating files in different formats: • Working format • Presentation format • Archival format
Working Format • The saved format of whatever program you are working in: • .doc (MS Word) • .xls (Excel) • .fp7 (FileMaker Pro) • This format is what you use for your own convenience and productivity • Typically this format is proprietary • Less typically, people may work in programs whose native format is not proprietary, automatically saving in .txt (plain text), .xml or .html (types of formatted plain text) • A proprietary working file format is not the only format you should have!
Archival Format • A very important format -- this format helps ensure that your resource will last and be usable well into the future • An archival format has LOTS of good qualities (Simons, 2004) • Lossless • Open Standard • Transparent • Supported by multiple vendors
Archival Format: Lossless • Avoid compressed formats that lose content • A good rule-of-thumb is to use uncompressed formats: • Text: .txt, .html, .xml • Images: .tiff, .bmp • Audio: .wav (Windows), .aiff (Apple), .au (Sun, Java, Unix) but make sure it is PCM (uncompressed) • Video: .avi (some codecs), .rtv • Most compressed formats lose content, but some are lossless (.zip for text, black and white .gif for images, .ale Apple Lossless Encoding for audio, jpeg2000 video codec) -- use with caution!
Archival Format: Open • Avoid proprietary formats like .doc, .xls, .fp7 • The company that produces the software may stop supporting the format, rendering your file unreadable • For your archival format, choose a file format that is “open standard” like .xml, .html, .pdf or .rtf • “Open standard” means that the specification of the format is publically available, and anyone can implement it.
Archival Format: Transparent • Use a file format that is easy to interpret • Example: text files (.txt) • Have common characters like letters, numbers, punctuation • Virtually no formatting (tabs, returns) • Because of the simplicity of this file type, many programs can read it and make use of the data • Other transparent formats: .wav, .aiff can be read by any audio program • Not transparent: .zip, .mp3 (require a special algorithm for interpretation)
Archival Format: Supported • Prefer formats that are widely supported • If more vendors support it, it is less likely to become obsolete • This is another reason to prefer an open standard format to a proprietary one
Presentation Format • Presentation formats are those you choose for the convenience and ease of accessibility and display • It is fine that presentation formats be compressed, so long as you make a lossless archival copy as well • Examples of presentation formats include .pdf files, .mp3 files, .jpg images, MPEG-2 video
So far, so good? • As a responsible linguist creating digital language documentation that will last well into the future you… • Know the difference between a working, presentation, and archival file format • Know what makes a good archival format (LOTS) • Maintain an archival format of your data • Anything beyond this? Yes, a bit more…
Best Practice Digital Resources are… • Preservablein formats that are not vulnerable to decay or obsolescence (see LOTS) • Intelligibleso that content that is easily understood by future scholars • Accessible so that resources are easily discovered and accessed • They are also interoperable, but this is mostly a concern of archives and services (Simons and Dry, 2006)
Create Preservable Resources • Linguists are responsible for making preservable resources • That is, creating archival formats that follow the principles of LOTS
Create Intelligible Resources • In order to create resources that are intelligible to others, you must document your practices! • Documentation includes: • Your markup practices • The encoding you use • Metadata about your resources • This information should be kept a file or files in an archival format, and archived along with your resources.
Presentational Markup • Many people use presentational markup, particularly in the working formats like Microsoft Word. • Presentational markup means that aspects of the presentation (like bold, italics, indenting) are themselves meaningful • For example…
Example of Presentational Markup AS_5.2.1978_audio: Alice Spear, Potawatomi, “Crane Boy”, May 2, 1978, Mayetta, Kansas. <bold>AS_5.2.1978_audio</bold> <plain.text>AliceSpear</plain.text> <italics>“Crane Boy”</italics> <plain.text>May 2, 1978</plain.text> <plain.text>Mayetta, Kansas</plain.text>
Presentational Markup • Presentational markup is not recommended. BUT if you do use it, describe all meaningful aspects (e.g. “bold” means head word, “italics” is used for the part of speech)
Descriptive Markup • It is better practice to use descriptive markup, like XML • XML is basically text with “tags” that provide information about what is between the tags • <headword>mnomen</headword> • <gloss>rice</gloss> • Tags can be also used to group information, much like you would group information in a database record, and have a whole set of information in a database
Example of Descriptive Markup AS_5.2.1978_audio: Alice Spear, Potawatomi, “Crane Boy”, May 2, 1978, Mayetta, Kansas. <ID>AS_5.2.1978_audio</ID> <speaker>Alice Spear</speaker> <description>“Crane Boy”</description> <recording.date>May 2, 1978</recording.date> <location>Mayetta, Kansas</location>
Descriptive Markup: XML <?xml version=“1.0" encoding=“UTF-8"?> <?xml-stylesheet type=“text/xsl" href=“archive.xsl"?> <my.archive> <record> <identifier>AS_5.2.1978_audio</identifier> <subject.language code=“x-sil-POT"/><language code="en"/> <format>Analog audio recording on Cassette tape</format> <contributor refine="speaker">Alice Spear</contributor> <contributor refine="researcher">Laura Buszard-Welcher</contributor> <description>“Crane Boy” narrative told in Potawatomi and in English</description> <date code=“1978-05-02"/> <coverage>Mayetta, Kansas</coverage> <relation>digital audio: AS_5.2.1978_audio.wav, interlinear text: AS_5.2.1978_audio.txt</relation> <type.linguistic code=“primary_text"/> <rights>Some restrictions; contact field linguist</rights> </record> </my.archive>
Descriptive Markup: XML • It is a good practice to use standard tags where they are available. • OLAC has a set of tags that you would use for metadata to describe your resources • GOLD has a set of tags used for morphosyntactic description • Otherwise, be sure to document the meaning of the tags that you use • Although some people feel comfortable working in XML, many don’t like to use it as a working format. • Fortunately many common programs now allow you to save your work as an XML file.
The Advantage of XML • Besides creating an archival data file, XML has other advantages • By creating stylesheets, you can give the same XML file different presentation forms • For example…
Delimited Text • Another kind of markup that you might find yourself using is delimited text. • Spreadsheet and database programs allow you to export your data as text, delimited by a particular character • Comma separated text (.csv) • Tab separated text (.tab) • To help with intelligibility, create an initial record where the name of each field / cell is given inside the record itself. That way, the names of your fields / cells will be exported and saved along with the rest of your data. • Text data exported this way is good practice, particularly if you are careful about documenting your practices inside your fields / cells (for more on this see following slides).
Other aspects of markup • Document any special conventions that you use • What do your morpheme boundary markers mean (+ / - / = …any others?) • What glossing conventions do you use? Give the full names of abbreviations (e.g. POS means ‘possessive’, PV means ‘preverb’). • Describe grammatical terms that you use (like ‘aorist’, or ‘preverb’) and what it means for the language you are describing. You don’t have to write a grammar -- a sentence or two describing the term is sufficient) • Also note if you are using standard terminology sets, like Leipzig Glossing Rules, or GOLD terminology
Document the Encoding • Identify the character set you are using • Document any non-standard characters • Best practice is to use Unicode
Create Metadata • You will need to create some additional information about your resources • Metadata usually includes information about: • The setting (time, date, participants, location) • The language (ISO 636-3) • Linguistic type (text, grammar, lexicon) and subject • Access restrictions • There are metadata standards for language resources: OLAC and IMDI
OLAC Metadata Elements http://www.language-archives.org/OLAC/olacms.html
Create Metadata • Keep a metadata record for each of your resources. • The records should themselves be in an archival format. This could be: • A text file (good) • Delimited text, exported from a simple database file (good) • An XML file (better) • An OLAC or IMDI formatted XML file (best) • Your archivist may have a preference about metadata formats, and prefer something relatively simple (like a paper form) if the archive will be manually entering the metadata. • Archive this file along with the rest of your resources.
Make your resources accessible • Archive, archive, archive! (Not just on your own, or your departmental server. Archives are committed to the long-term preservation and availability of your resources.) • Before you leave to do fieldwork, or when you are writing your grant, establish contact with the archive where you intend to deposit your resources • Archivists will • give you guidelines for creating archival files • help you select the best metadata set • give you information about setting access levels • When you return, the first thing to do is send your files, along with the metadata and markup descriptions to the archive • Most archives will then give you an ID number for your resources that you can then cite in your publications
A Community Responsibility • Best practice involves what individual field linguists do, but also how we collectively use and care for these resources • This broader community involves • Other researchers like yourself who create resources • A growing set of interconnected digital language archives that care for, protect, and disseminate your resources • People who develop tools and services to make your resources locateable, searchable, and reusable • Others: linguistics organizations, organizations like OLAC and DELAMAN, funding agencies who promote the work of this community
Unicode • Debbie Anderson “A field linguists’ guide to Unicode” • Michael Appleby “How to use Unicode on your computer”
Field Case Studies: Texts and Databases • Jessica Boynton • “Transcription, Time-Alignment and Annotation” • Naomi Fox • “Using Filemaker Pro to produce archivable language documentation” • Connie Dickonson • “The Tsafiki Text Factory”
Panel Session • Talks are 25 minutes, consecutive. • Please remember or write down your questions! • We will field them in a panel session after the talks.