330 likes | 479 Views
Developing a Digital Library for the Humanities. Gregory Crane (gcrane@tufts.edu) Winnick Family Chair in Technology and Entrepreneurship Professor of Classics Director, Perseus Digital Library Project Http://www.perseus.tufts.edu/About/grc.html. Perseus Digital Library.
E N D
Developing a Digital Library for the Humanities • Gregory Crane (gcrane@tufts.edu) • Winnick Family Chair in Technology and EntrepreneurshipProfessor of ClassicsDirector, Perseus Digital Library ProjectHttp://www.perseus.tufts.edu/About/grc.html
Perseus Digital Library • On-going areas of Development • 1987: DL on Classical Greek Culture • 1993: History of Science • 1996: Began work on Latin and Rome • 1997: Early Modern English • 1999: History and Topography of London • 2000: Ancient Egyptian Giza • 2000: Slavery and the US Civil War
Partner Institutions • Max Planck Institute for the History of Science (Berlin) • Museum of Fine Arts, Boston • Stoa Publishing Consortium • New Variorum Shakespeare Series, Modern Language Association • Special Collections at Tufts, Brandeis, the University of Pennsylvania
On-Going Support • National Endowment for the Humanities(DLI2, Preservation & Access, Education) • National Science Foundation (DLI2) • Fund for the Improvement of Postsecondary Education, Dept of Ed. • Max Planck Society
The Whole greater than the sum • Tufts Health Sciences Database: • An on-line Medical School Curriculum • First iteration: 70% of the value • Second Iteration: 90% • Third Iteration: 130% • “Data” and “system” interact in increasingly dynamic ways.
Persistent value over time &space • How many ages hence Shall this our lofty scene be acted over,In states unborn and accents yet unknown? • Brutus in Julius Caesar • How do we structure data for • Contemporary users we can’t directly anticipate? • Systems not yet designed?
Radically New Documents • Reconstructions of Historical Spaces, e.g. • UVA’s Crystal Palace (London) • UCLA’s Rome and VR Lab • Integrating Virtual Spaces with Sources • Museum of Fine Arts, Tombs at Giza • Greek Sculpture • The Streets of 19th Century London
Traditional Docs Rethought • Concordance: “Obsolete” • Bibliographies — databases • Encyclopedias — automatic linking • Lexica and lexicography — • Automatically discovered semantic rel-s • THEN lexicographic work
Development is two part • Ultimate end: Radically new docs? • Short term: Electronic Incunabula • New Variorum Shakespeare • Electronic Marlowe • Tallis Street Maps • FIRST we thoroughly analyze what we have • THEN radical redesign emerges
Technology outruns Practice • The 3D Reconstruction/Virtual Space • Cutting edge technology • Still nascent scholarly practices • Mature Document Structures • Textual Notes: 1908 Richard 3 • Traditional Text Citations: 1887 Commentary
The More Things Stay the same... • “Content” can remain unchanged • “Presentation” is dynamic and flexible • The Dictionary knows what you are reading • Citations —> Bidirectional links • Automatic Linking by keyword • Text and Atlas: Plot sites in a document
Current Paradigm: DL Dipomacy • Monolithic Systems (e.g., Perseus!) • One way to view each document • Intercommunication via metadata • DL as metadata for “opaque” objects • Major Problems • Renting access, rather than collecting content • All publications become ephemera
Three Strategies • 1) The Editing Problem — • How do real authors create structured docs? • 2) Developing Radically New Docs — • Archimedes DL on Mechanics • MFA Excavations at Giza • 3) Radical Repurposing of Print • Bolles Collection on London
Bolles Collection at Tufts • documenting the history and topography of London and its environs • 35 "full-size” maps • 320 more specialized maps • 400 books (284 linear feet of shelf space) • 1,000 pamphlets. • “Paper Hypertexts” • 10,000+ “extra illustrations”
Bolles Electronic Archive • A Testbed for the Perseus Digital Library • “Level 5” TEI Encoded Full Text • Quotes, languages, proper names, dates, money • High-end OCR and Double Keyboarding • OCR ideal for some but not all • Keyboarding much the best — money permitting
Bolles — Initial Texts • Five Million Words now in L5 TEI • Will exceed 10 million by year’s end • Surveys of London History and Topography • Stow, Maitland, Wilkinson, Allen, Thornbury • Commentary on social conditions • Mayhew, Archer, Hollingshead, Booth • Literary works with London as backdrop • Defoe, Dickens, “Sherlock Holmes”
Images • 10,000 Grayscale Images • Mainly engravings of people and places • “opportunistic” metadata (=captions & context) • 2,400 Contemporary Images • Well catalogued and geo-referenced • QTVR Panoramas • 70 Tallis Map “Elevations”
Geospatial Data • Bartholomew 1:5000 Data set for London • Modern data as reference and interchange • Historical maps georeferenced to Barth. Data • 10 so far (c. 2 hours each) • Urban maps do not easily “line up” • How to create an historical GIS? • GPS Waypoints • As of May 2000, good to within 10m. or better
Feature Extraction • Easy identification: Dates, Money • Known Keywords and Classes • The Getty TGN (1 m. places and lon/lats) • The Bartholomew Gazzetteer (10,000) • Indices to Maps (e.g. Cruchley 1826, 4200) • The Index/Abstract of the DNB (30,000+) • Clean-up with rule based Proper Name classification: Mr NAME; NAME street
“Runtime” Links • Runtime links supplement in file tagging • 1) Where metadata is less precise • Metadata from unedited headers and captions • 2) Where the source does not contain data • If no dates, then scan for them • Use tagging for “high confidence” data • Ideal situation: automated tags hand proofed
Strategic Questions • “Editions” a foundation for scholarship • Where does the editor’s job start? • How does editor’s job change? • How do we define “Corpus Editors”? • People with domain expertise in content • Expertise in software and Library systems • Need for scholarly automated processing
Delivering Integrated Data • “Good” and “rough” maps for Cic’s Letters • Coleman delivers quite useful results • Map locates Coleman Street. • Streets in description of "Portsoken Ward”. • Historical Views of this section of London • Timeline 1: A Linear History • Timeline 2: “Encyclopedic Scatter”
Further Work • Disambig., auto-cataloguing, Time/Space • VR Interface: Tallis 1, 2 and Headset • New challenging document types • Geospatial Data in : Patterson's Journeys • Urban data in Booth and City Directories. • Tallis Map for Oxford Street with overall and more focused directories.
Research Projects • Robert Jacob and VR Interfaces • Figure: Tallis VR Conversion 1. • Figure: Tallis VR Conversion 2.. • Figure: Head mounted VR navigation. • Holly Taylor and Cognitive Analysis • Spatial Cognition • Text Comprehension
Conclusions • Baseline Knowledge Environment • Practical and useful • “Corpus Editions” • Midway between editions and library digitiz. • Requires a new config. of skills • The “Diplomatic” Federated DL model weak • Need access to full data for visualizations
Perseus Document Manager • Works with XML • Multiple granularities: sentence, section, chapter • Deals with overlapping doc hierarchies • Combines internal and external metadata • Our metadata in RDF and can be XML • Since all data and metadata —> XML • Well suited to Federated DL Applications
Scalable DL • SGML/XML need translation for display • Can’t maintain stylesheets for millions of docs • Intelligent display of various DTDs • “Cheaply” acquires XML/SGML docs • Individual Custom Style sheets allowed • Integration of Geo-spatial Data • Multilingual support, feature extraction • Integrated multi-resolution image support
Perseus Document Manager • Short term development: • Collecting new datasets to the Perseus DL • (leveraging Internet 2 investment) • Adding value: e.g., • Sources for the History of Mechanics (Max Planck) • Duke Databank of Documentary Papyri • Books, maps etc. on the City of London • Shakespeare and Early modern English
Perseus Document Manager • Longer Term: Distribution of the System • How best to maintain and expand the system? • Open source? • Commercial Licensing? • Wait for third party to match PDM features?
Automatic Integration • Content Analysis: Various Languages • Time: extracting and visualizing dates • Space: Integrating historical Geographic Data • Names: establishing authority lists • Getty Thesaurus of Geographic Names • Names and Coordinates • Encyclopedias: e.g., Harpers, DNB • Names and Dates
Our Research Agenda • Developing a self-sustaining models • Publication of documents • Maintenance of software • Exploring Problem Sets in different domains • E.g., sparse data (antiquity) vs. rich (London) • Helping humanists rethink their position • Reaching new audiences • Changing habits
Technology matters: e.g.19th c. Printing in England • 20th Century Radio/Film/TV: ambiguous • 19th Century Print Technology • 1810: c. 10,000 copies for a successful book • Audience for literature mainly upper class • 1850: hundreds of thousands • Audience vastly expands • Huge numbers read Dickens, etc. • 21st Century Network Technology?
The Future? • Two models: • Reproduce current world in new form • Narrow/expensive distribution • Think about how that world may change • Broader/inexpensive distribution • What happens now sets the stage for … • “talk show” cyber culture? or • a new dispersal of intellectual life?