370 likes | 474 Views
Shoebox – Starting out and lexical management. Shoebox / Toolbox. What is it? Shoebox is a data management program for language data. It is not a text editor, but nor is it a database management system in the sense usually understood (it is not an implementation of a relational database system).
E N D
Shoebox / Toolbox • What is it?Shoebox is a data management program for language data. It is not a text editor, but nor is it a database management system in the sense usually understood (it is not an implementation of a relational database system). • Where do I get it?Shoebox:http://www.sil.org/computing/shoebox/index.htmlCost: US$19.95Toolbox:http://www.sil.org/computing/toolbox/Freeware
Shoebox and Toolbox • Which program / version should I use?If you use a Windows PC, then you should certainly use Toolbox – it has features which are not in Shoebox such as better Unicode compliance, better xml export and (best of all) it supports scrolling from a mouse. • If you work on Mac, you don’t have a choice – Shoebox works on Mac, but Toolbox doesn’t (except under VirtualPC). And Shoebox runs under OS9. • The application is not officially available for Linux (or UNIX). But Shoebox and Toolbox will actually run under Linux with an add-on (ask Baden).
Why Shoebox? • Advantages • Good functionality • Choice of output possibilities • Portability – simple file formats • Drawbacks • Data input not always easy (but no other application is better!) • Manuals are not always easy to use • Way interlinear is stored means you need to revise often • Everything is text, only weak data typing is possible
Installing • When you install Shoe/Toolbox, (versions later than 4), the installation creates a folder called “My Shoebox Settings” on your C: drive. • By default, all Shoebox files will be stored here, but you can easily change the setting. • But you have to careful moving things! • Look at the sample files for help.
Basic concepts • ProjectA project is the work unit in Shoebox. A project file (.prj) is a shell file which holds information about what files are included in the project and what their properties are. • Database typeWhen you set up a project, you have to define the type of the database files which you want to use in that project – that is, you must specify what fields are included in the file and what the properties of the fields are. (Shortly, we will work through setting up a lexicon database) • Language encodingA crucial property which you have to set for each field in a database is the language encoding which will be used. A language encoding includes information about: • List of characters • Case pairs • Sort order • Onscreen presentation • Variables
Language Encoding - 1 • Exploring the Help files, there is lots of useful information on how to set properties of a Language Encoding, but nothing about how to choose the characters which you want to use! • You can do this in two ways: • Create a new encoding file then use the Language Encoding dialogues to work through all the bits and pieces which you want to do. (tricky) • Create a new encoding file, then open it in a text editor and manipulate it there. (easier)
Language Encoding - 2 • Case pairs – you have to tell the program which pairs of symbols to treat as alphabetically equivalent, e.g. A = a for sorting and for parsing. • Sort order – you have to tell the program what order you want the alphabet to be in for sorting, e.g. if you use glottal stop, where should it be in the alphabet? • Fonts – you can specify the on-screen characteristics of each language encoding which you use. This is useful to make the screen easier to read. • You can also specify screen characteristics for fields when you define a database type, overriding or modifying the language settings • Neither of these options affects the presentation of data when you export to the Multi Dictionary Formatter (MDF) – MDF uses its own font settings regardless.
Language Encoding - 3 • Variables – you have to specify which characters will be included in which sets of variables. The default groupings which are set in the program are: • Everything • Lower case • Upper case • Vowels • Consonants • Nasals • Punctuation • Digits • These variable definitions are used for wildcards in searches, and for specifying some morphological processes in parsing.
Language encoding and data input • Inputting non-ASCII characters is a problem! • One solution is to use a keyboard mapping utility – Tavultesoft Keyman is recommended for Shoebox • One option available in a language encoding is to associate a keyboard mapping with that language • Keyboard definitions are available for Keyman, but if what you want doesn’t exist, you have to make a definition yourself • An alternative, assuming you are using Unicode, is to use UniPad • A Unicode text editor • Keyboards can be made by dragging and dropping • Keyboards are both hard (you type) and soft (click on display on screen)
Database types - 1 • Relational database (e.g. Access) • One field (or a combination of fields) must have data and function as unique identifier • Every field specified in the definition occurs in every record • Every field specified in the definition occurs only once in each record • Non-relational database (Shoebox) • One field specified in the definition must occur in every record as unique identifier – the record marker • Other fields can occur many times in each record
Database types – Markers 1 • Shoebox database files are a special sort of text file: Standard Format Marker files • A new record starts with the occurrence of a record marker field • Each field has the structure: • Marker – ‘\’ character + identifying string • Text content – whatever is stored in the field • Return – indicates end of field • NB – database definitions and language encodings are also SFM files
Database types – markers 2 • When you define a database type, you define a set of markers • First you must define the record marker • For a lexicon, the head word is a good choice, as this will provide the default sort order for the file • For each marker and its associated field, you can specify various properties.
Marker properties - 1 • Marker – from standard list for MDF, or mnemonic • Name – should be unambiguous, relates to marker • Hierarchy – more to follow on this • Following field – useful if one field will always occur with another one • Language encoding – ensures that needed characters are available for that field • Description – important documentation for other users (or you in a few years!) • Font – you can allow the default font settings which go with the language encoding, or you can override them
Marker properties - 2 • Although Shoebox doesn’t permit any strong data typing, you can do a little bit to make things more secure • You can specify that a field cannot be empty (other then the record marker which must have data anyway) • You can specify that a field will not contain spaces • Range set – you can specify that a field will only contain one of a set of specified values, useful or e.g. part of speech, semantic domains
Database types – dates • Date stamping – you can include a date field (usually \dt) in your database and enable automatic date stamping • Date stamping happens on insertion of a record and then again whenever a record is edited – if you want to preserve the information about when you first entered a record, this has to be done manually • You have to create a date field before you can enable date stamping
Hierarchies • Hierarchies are used to create structure within records • This feature is especially valuable in a lexicon file which has sub-entries with multiple part of speech and gloss information • Hierarchies are defined for each field in the Markers window of the Database Type dialogue • The predefined MDF_4.0 database type has a complex hierarchy included • A properly defined hierarchy ensures that all relevant information is retrieved in sorts and filters i.e. glosses for all sub-entries rather than just the first gloss entry
Other database properties • There are plenty of other features which can be set for a database • Many of these are not so relevant to lexica – interlinear, jump path etc. • We will return to some of these this afternoon
MDF fields • The full definition of the MDF_4 database type has 103 fields specified • It is unlikely that you will want to use all of these! • There are three possible approaches: • Use the preset and just don’t bother about the fields you don’t use • Eliminate fields from the preset until you have what you want • Create a new database definition from scratch • We’ll work through option 1 here
Entries in the dictionary • The record marker for a MDF file is \lx – the lexeme • This can be morpheme smaller than a word • Other forms can be included: • A citation form \lc • A phonetic form \ph • Alternative forms: to be listed in a dictionary, these are entered under \va, for interlinear use they are typically entered under \a which is not a defined field in MDF_4
Sub-entries, sense numbers and homonyms • Homonyms should be used where forms are identical but there is no semantic relationship • Homonyms are identified only by a number in the field \hm • Sub-entries should be used where a word or phrase is derived from the root • Sub-entries are identified by numbers in the field \se • Where a form has multiple sense within the same part of speech, the senses are identified by a number in the field \sn
The hierarchy in entries • The hierarchical structure of entries set up by the various markers is:Head item homonym 1 pos1 sense1 sense2 pos2 sense1 subentry1 pos1 sense1 subentry2 pos1 sense1 homonym2 pos1 sense1 sense2
Word classes • As we just saw, word classes are very important in the hierarchical structure • The field used for this information is \ps • A field is also available for word class names in a second language \pn • The MDF format assumes that you will work with three (or four) languages: • A vernacular language (the object language) • A national language • An international language (probably English) • (a regional language can also be used) • The MDF_4 file recommends use of range sets for these word class fields – this is unrealistic at early stages, you have to know a lot about a language before you are confident about listing word classes exhaustively
Glosses and definitions • Single word glosses for use in interlinears can be entered in English (\ge) and the national language (\gn) (\gr is also available) • More extended definitions can be entered in English (\de), national language (\dn) and the vernacular (\dv) (\dr is also available) • Encyclopedic information can be entered in all three (four) languages: \ee, \en, \ev, (\er)
Semantic information • MDF_4 offers a semantic domain field (\sd - English) and also a thesaurus field (\th - vernacular) • For both, use of a range set is recommended, but again this is unrealistic in the early stages of research • It is better to allow categories to be added freely until a good picture is obtained of the semantic domains needed, then move to restricting the possible entries
Examples • MDF_4 has five fields for including example phrases or sentences • \rf – to provide a reference to the example • \xv – vernacular text (i.e. the actual example) • \xe – English text, a free translation • \xn – national language text, a free translation • \xr – regional language text • As Shoebox is a non-relational database, it is possible to use each of these fields several times in one record – you can include as many examples as you like for each entry • There is a hierarchy here: • \rf is under sense number, and allows you to give a reference for each example • \xv is under \rf and over the other \x.. fields, ensuring that the translations for each example stay together
Notes • The MDF_4 definition specifies many fields for notes • All the fields which are defined will be exported in the MDF process • So maybe more important than the distinctions allowed in the preset is a distinction between information which will appear in the dictionary, and information which is for your use only • I recommend creating a notes field which isn’t part of the MDF presets! • \so – a field for the source of the data
Miscellaneous • \bw – borrowed word, for entering the source language • \cf – cross-reference, plus fields for glosses for the referenced item • \mr – morphology, for showing the internal structure of morphologically complex items (note that this may not be desirable for interlinear glossing!) • Various reversal fields – used in making finder lists, can be used if you don’t want the given gloss to be the reversal of an entry
Housekeeping • Date stamping is very valuable – for example it can be useful to be able to sort or filter entries by date • But as noted before, if you want to keep track of both the date of insertion of a record and the date of last edit, you will need two fields and you will have to manually enter the date in the first one • MDF_4 also allows a status field (\st) which is very useful for tracking whether an entry is complete and fully checked, whether it is in the last printed version of a dictionary etc.
Other stuff • Obviously we have only looked at a few of the fields which are defined in MDF_4 • It is worth looking through the entire list to see what could be relevant to your needs • Reversal fields are certainly worth investigating • But it is also worth remembering that you can achieve a lot with a reasonably small number of fields
Range sets • Range sets, as previously mentioned, are used to limit the values which can appear in a field • Often, it is not possible to specify a set of values when you start work on a language • When you have some data, Shoebox can automatically create a set of values for you from what is already entered • You must remember to check the “Use a Range Set” box in the Marker properties section of the Database Type dialogue
Consistency checks • Shoebox can perform some checking of data for you automatically • If you choose Consistency Check from the Tools menu • If you specify that data should be checked in an export process • When you move to a new record if you have Check Consistency When Editing enabled on the Tools menu • In any of these cases, the program will check: • That data matches any Data Property settings • That data matches any Range Set settings • That Jump Path destinations are valid links • It is valuable to constrain data as much as possible and to reduce the possibility for entering invalid data
Export processes • The most important export process when working with a lexicon is the Multi Dictionary Formatter (MDF) • This powerful facility creates fully formatted dictionaries and finder lists from your lexicon file • The results are Rich Text Format files (.rtf) which can be opened and manipulated in most word processing packages (such as Word)
MDF basics • You can choose to export your data to a bilingual dictionary or a trilingual dictionary • If bilingual, you can choose whether the second language is English or the relevant national language • If trilingual, English and the national language are used • (Regional language apparently vanishes at this point)
Other options • Data can be filtered (i.e. only entries which correspond to some criteria are included) • Fields can be excluded • Some formatting can be controlled: • Header and footer material • Total number of entries is printed • Output file can be .rtf or web pages (HTML)
Other export possibilities • You can export all your data as a document in .rtf format, or a text format which Shoebox describes as ‘standard’ • In these exports, you can export the records in the current window, or all records • You can define other export processes for yourself – if you feel brave!
Lexique Pro • Lexique Pro is a freeware tool distributed by SIL via www.lexiquepro.com • It is intended to produce versions of lexica for distribution to people who are not Shoebox users • The program makes a version which is well-formatted for on-screen viewing • It also makes an executable file (.exe) to distribute the lexicon to other people – this will install a run-time version of Lexique Pro and the database extracted from your lexicon onto another persons computer • You can also export your lexicon as web pages