1 / 39

A Field Linguist’s Guide to Making Long Lasting Texts and Databases

A Field Linguist’s Guide to Making Long Lasting Texts and Databases. LSA Organized Session January 4, 2007 Anaheim, California. Organized by: Jeff Good and Heidi Johnson Open Language Archives Community (OLAC) Outreach Committee Moderator: Laura Welcher Speakers: Debbie Anderson,

jarvis
Download Presentation

A Field Linguist’s Guide to Making Long Lasting Texts and Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

  2. Organized by: Jeff Good and Heidi Johnson Open Language Archives Community (OLAC) Outreach Committee Moderator: Laura Welcher Speakers: Debbie Anderson, Michael Appleby, Jessica Boynton, Naomi Fox, Connie Dickinson

  3. Presentations from this session will be posted athttp://www.language-archives.org/news.html#olac07

  4. Best Practice in Your Back Pocket: Getting the Most Out of the Tools You Have Laura Welcher The Rosetta Project / Long Now Foundation

  5. A great way to freak out a linguist “To be in compliance with best practice recommendations (ahem), your interlinear glossed text needs to be in XML format with morphosyntactic tags that reference the GOLD ontology.”

  6. Reality Check • There’s a difference between ideal best practice resources (which is still somewhat of a moving target) and a good, sufficient approximation. • Some common practices are far from ideal or sufficient (like saving the dictionary you worked 5 years on as a Microsoft Word document file). • We can easily modify these practices to produce archivable resources that will last. • And this can be done using tools that you already have, and knowledge that is easy to acquire. • Hence the title: Best practice in your back pocket: getting the most out of the tools that you have.

  7. Best Practice • E-MELD project (Electronic Metastructure for Endangered Languages Data) • Goals: • Help preserve endangered languages data • Develop infrastructure for electronic archives • Defining best practice • E-MELD summer workshops http://www.emeld.org • Promoting best practice: • “School of Best Practice” at http://www.emeld.org/school/index.html

  8. Good, Better, Best Practice • The information presented here comes from presentations of the E-MELD team, particularly the following: • Simons and Dry (2006) Good, Better, and Best Practice The Experience of the E-MELD Project http://www.linguistlist.org/emeld/documents/Bielefeld-Dry-Simons.pdf

  9. The first consideration:working, presentation and archival formats • The process of creating digital language resources usually involves creating files in different formats: • Working format • Presentation format • Archival format

  10. Working Format • The saved format of whatever program you are working in: • .doc (MS Word) • .xls (Excel) • .fp7 (FileMaker Pro) • This format is what you use for your own convenience and productivity • Typically this format is proprietary • Less typically, people may work in programs whose native format is not proprietary, automatically saving in .txt (plain text), .xml or .html (types of formatted plain text) • A proprietary working file format is not the only format you should have!

  11. Archival Format • A very important format -- this format helps ensure that your resource will last and be usable well into the future • An archival format has LOTS of good qualities (Simons, 2004) • Lossless • Open Standard • Transparent • Supported by multiple vendors

  12. Archival Format: Lossless • Avoid compressed formats that lose content • A good rule-of-thumb is to use uncompressed formats: • Text: .txt, .html, .xml • Images: .tiff, .bmp • Audio: .wav (Windows), .aiff (Apple), .au (Sun, Java, Unix) but make sure it is PCM (uncompressed) • Video: .avi (some codecs), .rtv • Most compressed formats lose content, but some are lossless (.zip for text, black and white .gif for images, .ale Apple Lossless Encoding for audio, jpeg2000 video codec) -- use with caution!

  13. Archival Format: Open • Avoid proprietary formats like .doc, .xls, .fp7 • The company that produces the software may stop supporting the format, rendering your file unreadable • For your archival format, choose a file format that is “open standard” like .xml, .html, .pdf or .rtf • “Open standard” means that the specification of the format is publically available, and anyone can implement it.

  14. Archival Format: Transparent • Use a file format that is easy to interpret • Example: text files (.txt) • Have common characters like letters, numbers, punctuation • Virtually no formatting (tabs, returns) • Because of the simplicity of this file type, many programs can read it and make use of the data • Other transparent formats: .wav, .aiff can be read by any audio program • Not transparent: .zip, .mp3 (require a special algorithm for interpretation)

  15. Archival Format: Supported • Prefer formats that are widely supported • If more vendors support it, it is less likely to become obsolete • This is another reason to prefer an open standard format to a proprietary one

  16. Presentation Format • Presentation formats are those you choose for the convenience and ease of accessibility and display • It is fine that presentation formats be compressed, so long as you make a lossless archival copy as well • Examples of presentation formats include .pdf files, .mp3 files, .jpg images, MPEG-2 video

  17. So far, so good? • As a responsible linguist creating digital language documentation that will last well into the future you… • Know the difference between a working, presentation, and archival file format • Know what makes a good archival format (LOTS) • Maintain an archival format of your data • Anything beyond this? Yes, a bit more…

  18. Best Practice Digital Resources are… • Preservablein formats that are not vulnerable to decay or obsolescence (see LOTS) • Intelligibleso that content that is easily understood by future scholars • Accessible so that resources are easily discovered and accessed • They are also interoperable, but this is mostly a concern of archives and services (Simons and Dry, 2006)

  19. Create Preservable Resources • Linguists are responsible for making preservable resources • That is, creating archival formats that follow the principles of LOTS

  20. Create Intelligible Resources • In order to create resources that are intelligible to others, you must document your practices! • Documentation includes: • Your markup practices • The encoding you use • Metadata about your resources • This information should be kept a file or files in an archival format, and archived along with your resources.

  21. Presentational Markup • Many people use presentational markup, particularly in the working formats like Microsoft Word. • Presentational markup means that aspects of the presentation (like bold, italics, indenting) are themselves meaningful • For example…

  22. Example of Presentational Markup AS_5.2.1978_audio: Alice Spear, Potawatomi, “Crane Boy”, May 2, 1978, Mayetta, Kansas. <bold>AS_5.2.1978_audio</bold> <plain.text>AliceSpear</plain.text> <italics>“Crane Boy”</italics> <plain.text>May 2, 1978</plain.text> <plain.text>Mayetta, Kansas</plain.text>

  23. Presentational Markup • Presentational markup is not recommended. BUT if you do use it, describe all meaningful aspects (e.g. “bold” means head word, “italics” is used for the part of speech)

  24. Descriptive Markup • It is better practice to use descriptive markup, like XML • XML is basically text with “tags” that provide information about what is between the tags • <headword>mnomen</headword> • <gloss>rice</gloss> • Tags can be also used to group information, much like you would group information in a database record, and have a whole set of information in a database

  25. Example of Descriptive Markup AS_5.2.1978_audio: Alice Spear, Potawatomi, “Crane Boy”, May 2, 1978, Mayetta, Kansas. <ID>AS_5.2.1978_audio</ID> <speaker>Alice Spear</speaker> <description>“Crane Boy”</description> <recording.date>May 2, 1978</recording.date> <location>Mayetta, Kansas</location>

  26. Descriptive Markup: XML <?xml version=“1.0" encoding=“UTF-8"?> <?xml-stylesheet type=“text/xsl" href=“archive.xsl"?> <my.archive> <record> <identifier>AS_5.2.1978_audio</identifier> <subject.language code=“x-sil-POT"/><language code="en"/> <format>Analog audio recording on Cassette tape</format> <contributor refine="speaker">Alice Spear</contributor> <contributor refine="researcher">Laura Buszard-Welcher</contributor> <description>“Crane Boy” narrative told in Potawatomi and in English</description> <date code=“1978-05-02"/> <coverage>Mayetta, Kansas</coverage> <relation>digital audio: AS_5.2.1978_audio.wav, interlinear text: AS_5.2.1978_audio.txt</relation> <type.linguistic code=“primary_text"/> <rights>Some restrictions; contact field linguist</rights> </record> </my.archive>

  27. Descriptive Markup: XML • It is a good practice to use standard tags where they are available. • OLAC has a set of tags that you would use for metadata to describe your resources • GOLD has a set of tags used for morphosyntactic description • Otherwise, be sure to document the meaning of the tags that you use • Although some people feel comfortable working in XML, many don’t like to use it as a working format. • Fortunately many common programs now allow you to save your work as an XML file.

  28. The Advantage of XML • Besides creating an archival data file, XML has other advantages • By creating stylesheets, you can give the same XML file different presentation forms • For example…

  29. Delimited Text • Another kind of markup that you might find yourself using is delimited text. • Spreadsheet and database programs allow you to export your data as text, delimited by a particular character • Comma separated text (.csv) • Tab separated text (.tab) • To help with intelligibility, create an initial record where the name of each field / cell is given inside the record itself. That way, the names of your fields / cells will be exported and saved along with the rest of your data. • Text data exported this way is good practice, particularly if you are careful about documenting your practices inside your fields / cells (for more on this see following slides).

  30. Other aspects of markup • Document any special conventions that you use • What do your morpheme boundary markers mean (+ / - / = …any others?) • What glossing conventions do you use? Give the full names of abbreviations (e.g. POS means ‘possessive’, PV means ‘preverb’). • Describe grammatical terms that you use (like ‘aorist’, or ‘preverb’) and what it means for the language you are describing. You don’t have to write a grammar -- a sentence or two describing the term is sufficient) • Also note if you are using standard terminology sets, like Leipzig Glossing Rules, or GOLD terminology

  31. Document the Encoding • Identify the character set you are using • Document any non-standard characters • Best practice is to use Unicode

  32. Create Metadata • You will need to create some additional information about your resources • Metadata usually includes information about: • The setting (time, date, participants, location) • The language (ISO 636-3) • Linguistic type (text, grammar, lexicon) and subject • Access restrictions • There are metadata standards for language resources: OLAC and IMDI

  33. OLAC Metadata Elements http://www.language-archives.org/OLAC/olacms.html

  34. Create Metadata • Keep a metadata record for each of your resources. • The records should themselves be in an archival format. This could be: • A text file (good) • Delimited text, exported from a simple database file (good) • An XML file (better) • An OLAC or IMDI formatted XML file (best) • Your archivist may have a preference about metadata formats, and prefer something relatively simple (like a paper form) if the archive will be manually entering the metadata. • Archive this file along with the rest of your resources.

  35. Make your resources accessible • Archive, archive, archive! (Not just on your own, or your departmental server. Archives are committed to the long-term preservation and availability of your resources.) • Before you leave to do fieldwork, or when you are writing your grant, establish contact with the archive where you intend to deposit your resources • Archivists will • give you guidelines for creating archival files • help you select the best metadata set • give you information about setting access levels • When you return, the first thing to do is send your files, along with the metadata and markup descriptions to the archive • Most archives will then give you an ID number for your resources that you can then cite in your publications

  36. A Community Responsibility • Best practice involves what individual field linguists do, but also how we collectively use and care for these resources • This broader community involves • Other researchers like yourself who create resources • A growing set of interconnected digital language archives that care for, protect, and disseminate your resources • People who develop tools and services to make your resources locateable, searchable, and reusable • Others: linguistics organizations, organizations like OLAC and DELAMAN, funding agencies who promote the work of this community

  37. Unicode • Debbie Anderson “A field linguists’ guide to Unicode” • Michael Appleby “How to use Unicode on your computer”

  38. Field Case Studies: Texts and Databases • Jessica Boynton • “Transcription, Time-Alignment and Annotation” • Naomi Fox • “Using Filemaker Pro to produce archivable language documentation” • Connie Dickonson • “The Tsafiki Text Factory”

  39. Panel Session • Talks are 25 minutes, consecutive. • Please remember or write down your questions! • We will field them in a panel session after the talks.

More Related