290 likes | 308 Views
This article explores the potential for integrating the TEI (Text Encoding Initiative) and CIDOC-CRM (CIDOC Conceptual Reference Model) standards. TEI is a set of recommendations for text encoding, while CIDOC-CRM is an ontology for cultural and natural history. The article discusses the similarities and differences between the two standards and proposes a possible interface for data integration.
E N D
TEI, CIDOC-CRM and a Possible Interface between the Two? Øyvind Eide & Christian-Emil Ore Unit for Digital Documentation, University of Oslo, Norway
The CIDOC Conceptual Reference Model(cidoc.ics.forth.gr) • What is the CIDOC CRM? • An object oriented ontology developed by ICOM-CIDOC, 1996-2005 • Accepted as ISO-21127 in June 2005 • About 80 classes and 130 properties for cultural and natural history • CRM instances can be encoded in many forms: RDBMS, ooDBMS, XML, RDF(S), OWL. • What is the CIDOC CRM for? • Intellectual guide to create schemata, formats, profiles Extension of CRM with a categorical level, e.g. reoccurring events • Best practice guide • A language for analysis of existing sources and models for data integration (mapping) • Transportation format for data integration / migration /Internet • Ongoing activities • CRM-Core • Harmonisation with object oriented version of FRBR, (Functional Requirement for Bibliographic Records, IFLA), first version will be published in fall 2006 • Extension of CRM with a categorical level, e.g. reoccurring events
The CIDOC CRM Top-level Classes relevant for Integration E55 Types refer to / refine E39 Actors (persons, inst.) E28 Conceptual Objects E41 Appellations refer to / identifie E18 Physical Things participate in affect or refer to E2 Temporal Entities (Events) have location at within E52 Time-Spans E53 Places
Motivation: Grey literature in Museums Original text (text witness) Step 1: registration Bibliographical record Step 2: reproduction Facsimile Step 3: transcription Text with XML mark-up 1. Structural mark-up (2. Lemmatization etc.) Step 4: content mark-up Museum database artefacts, excavations, referential information Event/object oriented model (CIDOC-CRM compatible) Text with XML mark-up Information elements identified and marked up according to a simple information model, DTD)
Motivation: Grey literature in Museums Catalogue entry 8. Malayan dagger, taken from pirates of the Indian Oceans. Beautiful handle, graven as a human figure above waistline. Snake winded blade. VII, IX, p, 2. Daa,O., 99. Donated April 11 1856 from Captain Teiste.
Motivation: Grey literature in Museums Catalogue entry with mark up <NRPAR> <CATNR NRID="EM8"> 8</CATNR>. <ARTIFDATA><PROD><USE><PEOPLE><PLACE> Malayan </PLACE></PEOPLE></USE></PROD> <ARTIFACT> dagger </ARTIFACT> , <AQUISITION> taken from <AQUFROM>pirates</AQUFROM> of the Indian Oceans. </AQUISITION> <DESCR>Beautiful handle, graven as a human figure above waistline. Snake winded blade. <LIT_REF>VII, IX, p, 2. Daa,O., 99.</LIT_REF></DESC> <AQUISITION> Donated <AQUTIME> April 11 1856 </AQUTIME> from <AQUFROM> Captain Teiste </AQUFROM>. </AQUISITION> </ARTIFDATA> </NRPAR>
The content of the text expressed in CIDOC-CRM P2 has type E55 Type E31 Document ”Archaeological report” P70 documents E55 Type P2 has type E7 Activity ”Archaeological excavation” P9 forms part of P12 was present at P4 has time-span E11 Modification E22 Man–Made object ”Breaking of the sword” E52 Time span “Sword” P7 took place at P14 carried out by E53 Place E21 Person (actor) P1 is identified by P1 is identified by P87 is identified by P78 is identified by E82 Object identifier E82 Actor appellaton E44 Place appellaton E50 Date ” C50435” ”Dr. Diggey” ”Wasteland” ”2005”
TEI - where did it come from? • Originally, a research project within the humanities • Founded in 1987-88 • Sponsored by three professional associations • Funded 1990-1994 by US NEH, EU LE Programme etal • Major influences • digital libraries and text collections • language corpora • scholarly datasets • International consortium established June 1999 (see • http://www.tei-c.org/) Acc. to L. Burnard
Goals of the TEI • better interchange and integration of scholarly data • support for all texts, in all languages, from all periods • guidance for the perplexed: what to encode — hence, a user-driven codification of existing best practice • assistance for the specialist: how to encode — hence, a loose framework into which unpredictable extensions can be fitted • These apparently incompatible goals result in a highly flexible, modular, environment Acc. to L. Burnard
TEI Deliverables • A set of recommendations for text encoding, covering both generic text structures and some highly specific areas based on (but not limited by) existing practice • A very large collection of element (400) definitions with associated declarations for various schema languages • a modular system for creating personalized schemas or DTDs from the foregoing • for the full picture see http://www.tei-c.org/TEI/Guidelines/ Acc. to L. Burnard
Legacy of the TEI • a way of looking at what ‘text’ really is • a codification of current scholarly practice • (crucially) a set of shared assumptions about the digital agenda: • focus on content and function (rather than presentation) • identify generic solutions (rather than application-specific ones) Acc. to L. Burnard
TEI - the header • Elements for detailed bibliographic description: • File description • Title statement • Edition statement • Extent statement • Publication statement • Series statement • Notes • Source Description • bibliographic elements • (Manuscript description) • Encoding description • Profile description • Revision description • Mapping to other meta data standards • Marc, discusset • Dublin Core unfinished
TEI additional element sets • Base Tag Set for Verse • Performance Texts • Transcription of Speech • Print Dictionaries • Manuscript description • Linking and alignment; analysis • Feature structures; • Certainty; physical transcription; textual criticism, • Names and dates • Graphs, networks and trees • Graphics, figures and tables • Language Corpora • Representation of non-standard characters and glyphs • Feature System Declaration
Some “ontological” elements in TEI: Events • History • groups elements describing the full history of a manuscript or manuscript part. • Origin • contains any descriptive or other information concerning the origin of a manuscript or manuscript part • CustEvent • describes a single event during the custodial history of a manuscript • Provenance • contains any descriptive or other information concerning the origin of a manuscript or manuscript part • Acquisition • contains any descriptive or other information concerning the process by which a manuscript or manuscript part entered the holding institution.
Some “ontological” elements in TEI: Events, time appellations • Event • (Event) any phenomenon or occurrence, not necessarily vocalized or communicative, for example incidental noises or other events affecting communication. Eg. “ceiling collapses” during a recorded interview • persEvent • contains a description of a particular event of significance in the life of a person • Birth,death • contains information about a person's birth/death, such as its date and place • Date • contains a date in any format. • Occasion • a temporal expression (either a date or a time) given in terms of a named occasion such as a holiday, a named time of day, or some notable event
Some “ontological” elements in TEI: Actors and appellations • Person • provides information about an identifiable individual, for example a participant in a language interaction, or a person referred to in a historical source. • Hand • used in the header to define each distinct scribe or handwriting style. • Author • in a bibliographic reference, contains the name of the author(s), personal or corporate, of a work; the primary statement of responsibility for any bibliographic item • Name • (name, proper noun) contains a proper noun or noun phrase
Some “ontological” elements in TEI: Person example (from P5 guidelines) <person xml:id="Ovi01" sex="1" role="poet"><persName xml:lang="en">Ovid</persName><persName xml:lang="la">Publius Ovidius Naso</persName><birth date="-0044-03-20"> 20 March 43 BC <placeName><settlement type="city">Sulmona</settlement><country reg="IT">Italy</country></placeName></birth><death notBefore="17" notAfter="18"> 17 or 18 AD <placeName><settlement type="city">Tomis (Constanta)</settlement><country reg="RO">Romania</country></placeName></death> </person>
A simple extension of the TEI-dtd The root CIDOC-CRM element <!ELEMENT crm (crmClass*, crmProperty*)> <!ATTLIST crm id #ID> The class element <!ELEMENT crmClass #PCDATA > <!ATTLIST crmClass id #ID className #CDATA> The property element <!ELEMENT crmProperty #EMPTY <!ATTLIST crmProperty id #ID propName #CDATA from #IDREF to #IDREF>
The text expressed with a TEI mark-up <p id="p1">The <rs id="e1">excavation in <name type="place" id="n1">Wasteland</name> </rs> in <date id="d1">2005</date> was performed by <name type="person" id="n2">Dr. Diggey</name>. He had the misfortune of <rs id="e2">breaking <rs id="o1">the beautiful sword <rs id=“o_id1”>(C50435)</rs> </rs> into 30 pieces </rs>. </p>
CRM-Core – a dtd for encoding information [suggested by CRM-SIG]
Encoding the information in CRM Core (Factoides) <CRM_Core> <Category>E31 Document</Category> <Classification>Archaeological report</Classification> <Identification>Wasteland excavation 2005 report</Identification> <Event> <Role_in_Event>P70_documents</Role_in_Event> <Identification>Wasteland_2005_excavation</Identification> <Event_Type>E7_Activity</Event_Type> <Participant>Dr. Diggey</Participant> <Participant_Type>excavator</Participant_Type> <Thing_Present>C50435 sword</Thing_Present> <Date>2005</Date><Place>Wasteland</Place> </Event> <Event> <Role_in_Event>P70_documents</Role_in_Event> <Identification>damage_to_artifact_C50435</Identification> <Event_Type>E11_Modification</Event_Type> <Participant>Dr. Diggey</Participant> <Participant_Type>excavator</Participant_Type> <Thing_Present>C50435 sword</Thing_Present> <RelatedEvent> <Role_in_Event>P9_forms_part_of</Role_in_Event> <Identification>Wasteland_2005_excavation</Identification> </RelatedEvent> </Event> </CRM_Core>
Encoding the information in CRM Core (Factoides) <CRM_Core> <Category>E21 Person</Category> <Classification>archaeologist</Classification> <Identification>Dr. Diggey</Identification> <Event> <Role_in_Event>P14 carried out by</Role_in_Event> <Identification>damage_to_artifact_C50435</Identification> <Event_Type>E11 Modification</Event_Type> <Participant_Type>excavator</Participant_Type> <Thing_Present>C50435 sword</Thing_Present> </Event> </CRM_Core> <CRM_Core> <Category>E82 Actor appellaton</Category> <Classification>formal name</Classification> <Identification>mention of name</Identification> <Relation> <To>Wasteland_excavation_2005_report#n2</To> <Relation_Type> <referred_to_by/> </Relation_Type> </Relation> </CRM_Core>
Conclusions and further work • Possible now • TEI extended with a RDF-like CIDOC-CRM • TEI extended with CRM-Core records • Future: • Make a mapping from TEI-elements to CRM • Make a mapping from the TEI-header into ooFRBR • Create an extension of the TEI definition • Write guidelines for CIDOC-CRM encoding of information in TEI documents • Convince the TEI users