350 likes | 364 Views
Databases for Renaissance and Early Modern Sources. Session Tutor: Sarah Richardson sarah.richardson@warwick.ac.uk. Using Databases. Databases may be used in a number of ways to support your research. Bibliography (see later sessions) For simple lists To analyse complex sources. Overview.
E N D
Databases for Renaissance and Early Modern Sources Session Tutor: Sarah Richardson sarah.richardson@warwick.ac.uk
Using Databases Databases may be used in a number of ways to support your research. Bibliography (see later sessions) For simple lists To analyse complex sources
Overview • Source assessment and data-modelling • The challenge of sources • How will relational databases help? • Source analysis • Database design and creation • Free text databases • Methodological issues
Challenges • Unstructured source material • Missing data • Complications with numbers and dates • Data comes from more than one source
Databases should look like this? Unique identifier or primary key Column or field or attribute Row or record Field name or attribute name
But what do you do with this? Letter from the Medici Granducal Archive
From Source to Database Frankpledge: Original source from The National Archive translated to the Thame Database
How will relational databases help? • A relational database is a database created with many tables linked together • Each table has a common factor which links it to others in the database • For complex sources a number of tables may be created to deal with different aspects of the data
Relational model Offences Table Defendant ID Case Number Offence Type Place of Offence Date of Offence Description Comments Defendant Table Defendant ID First name Surname Address Age Sex Occupation Title Comments Sentence Table Defendant ID Case Number Verdict Sentence Comments Witnesses Table Case Number Witness 1 First name Witness 1 Surname Witness 1 Address Witness 1 Sex Witness 2 First name Witness 2 Surname Witness 2 Address Witness 2 Sex Comments Occupational Categorisation Table Occupation Title Occupational Categorisation 1 Occupational Categorisation 2
Source analysis • Data should be broken down into components that collects groups of information into objects or events. • For example information relating to a person, an organisation, a document, an object or a building, or to events such as a marriage, a transaction, the making of a will, or an election. • In database terminology these are referred to as entities. • Each entity will form a table in the final database.
Attributes • Once each entity has been identified, list the data associated with each. • For example, the Defendant table has information on the first name, surname, address, age, sex and occupation of each defendant. • This information will produce the fields for each table. • The fields are also known as attributes.
Issues for field types • Size • Calculations • Dates • Currency • Unstructured data • Unique identifiers
Relationships • One-to-one relationships: records in one table have only one match with records in a second table. • One-to-many relationships: records in the first table match many in the second, but those in the second table only have one match. • Many-to-many relationships: records from both tables have relationships between them
Data entry tips • Fields may be designated as ‘required’. • Default values may be entered. • Use the tool to allow one of only two options to be entered such as Yes/No, True/False, Guilty/Not Guilty. • ‘Look-up’ tables: a fixed list of values that may be entered into a particular field. • Validation rules. • Automatic generation of unique numbers.
Free Text Databases • Free text databases search unstructured texts and images provided in digital form • They work by ‘tagging’ the text in a mark-up language (eg HTML, XML, SGML). In the past users had to do this. Now most programmes will do it for you. • The database may then be searched in a number of ways: full-text; wildcard searches with * and ?; Boolean searches (AND, OR, and NOT); proximity searches; numeric searches (>, <, >=, <=, <>); Date searches; Fuzzy searches
Zotero Zotero is an easy-to-use yet powerful research tool that helps you gather, organize, and analyze sources (citations, full texts, web pages, images, and other objects), and lets you share the results of your research in a variety of ways. Zotero is an easy-to-use yet powerful research tool that helps you gather, organize, and analyze sources (citations, full texts, web pages, images, and other objects), and lets you share the results of your research in a variety of ways. It stores author, title, and publication fields and exports that information as formatted references. It also has the ability to interact, tag, and search in advanced ways. http://www.zotero.org/
For anyone who writes with footnotes, Zotero is a fabulous tool. With a click of a mouse, it imports catalogue records from a library database, or JSTOR, or even Amazon, allowing a scholar to create a personal reference database on his desktop. Better still, it permits extensive annotations, keyword tagging, and hyperlinks both to other items in the database and to external materials. Some users know that it can catalogue images, too, pulling metadata from Flickr. If you already run Zotero and need to work with images, try it. The possibilities are mind-bending for those of us who work with visual resources. http://www.zotero.org/
Old Bailey Online http://www.oldbaileyonline.org/
Methodological Issues • Nominal record linkage • Coding • Occupational analysis • Prosopography • Community reconstruction
Nominal Record Linkage • Concerns all historians using data containing names • How do we determine that sources relate to the same person and not another person with the same name? • Particularly difficult for early modern sources where names are not fixed. • Two problems: • The existence of multiple common names. This problem is particularly acute in local communities where certain surnames are dominant. • Variation in spellings.
Solutions • Coding surnames using standardisation schemes, eg SOUNDEX or FISKs • Using multiple passes through the data changing variables each time as the data is matched • Using a combination of computer and manual techniques
SOUNDEX rules • Names With Double Letters: If the surname has any double letters, they should be treated as one letter. • Names with Letters Side-by-Side that have the Same SOUNDEX Code Number : should be treated as one letter. For example, Jackson or Schmidt. • Names with Prefixes: such as Van or De should be coded twice with and without the prefix • Consonant Separators: If a vowel (A, E, I, O, U) separates two consonants that have the same SOUNDEX code, the consonant to the right of the vowel is coded.
Problems with SOUNDEX • Does not work so well for European names. Works best with names of English origin • Does not work as well with early modern names and spelling variants • One solution for early modern historians is FISK
Four Letter Initial Surname Codes (FISK) • Consists of letters and punctuation marks • Generated from first letter of a surname variant plus up to three further consonants from the surname. • Vowels only used when they are the first letter of the surname • A full stop is used where no second, third or fourth letter is available for use.
If surname variants are deduced to be of the same surname base these names are considered to form a distinct surname group and the same FISK is allocated • Thus: Eyres is coded as ARS. Group Ayres. Morrice is coded as MRS. Group Morris • Bowyer is coded with Boyer and Springall with Springold. • Davies and Davidson are placed in one group. ap Howell is included in the group Powell
Five letter FISKs • Used to differentiate between similar but distinct surname groups. • Fifth letter would normally be a distinctive letter from the end of the surname, but any letter could be used, and often a vowel from the start of the surname would be convenient. • To distinguish Partridge from Porter (FISK = PRTR) an additional letter g is added to make the new FISK for Partridge (PRTRG). The code for Porter remains as (PRTR). • To distinguish Bailey from Bloy (FISK = BLY.) an additional letter y is added to make the new FISK for Bailey (BLY.Y) The code for Bloy remains as (BLY.)
Coding • Used to be necessary because databases could not handle large amounts of text • Historians still code: • data entry may be speeded up by using simple codes eg. ‘M’ for married, ‘U’ for unmarried, and ‘W’ for widowed but complicated coding may slow data entry down • Is a form of close assessment of the data and may lead to the development of categories for ease • May facilitate the process of record linkage
Deciding to code • Should coding take place before or after data entry? • Should codes be letters or numbers? Numbers mean high level of error • Coding schemes should make decisions in the light of other classification systems used by historians. • Full code book should be developed as part of the documentation to accompany the database.
Occupational analysis • Form of post-coding • Assist in analysing fields with numerous values • Most common type is categorisation of occupational information. • Must be able to compare with other research in the field and to provide as complete a picture as possible regarding the status and occupation of the population
Coding schemes • Modern historians use standardised occupational classification systems • Early modern historians often each devise their own schema • A compromise is to use a multi-dimensional approach: each occupation is classified using several different methods. Occasionally individual occupational titles may be isolated where any categorisation would destroy the nuances of work experiences.
Prosopography • Mostly used for study of elites • Database is created not from a single source but many bringing biographical data together • Use relational design to avoid very large, multi-field databases containing many blank fields • Consider issues of nominal record linkage
Community Reconstruction • Concentrates on bringing together all records from one place • Needs careful design • Primary methodological issue is one of record linkage, so documents, place names and individuals may all have their own ID codes