THE DIATHESIS NEWSPAPER DIGITIZATION SUITE

THE DIATHESIS NEWSPAPER DIGITIZATION SUITE Foundation of Research and Technology Institute of Computer Science Centre for Cultural Informatics Heraklion, Crete, Greece Martin Doerr, Georgios Markakis,Maria Theodoridou

About DIATHESIS Diathesis is a newspaper digitization suite whose primary purpose is the digitization, classification and dissemination of archival newspaper material. It was originally used for the digitization of the Vikelaia Municipal Library’s newspaper collection (1890-1960) at Heraclion, Crete. It has evolved as an independent digitization suite since. Used in other projects as well (Filekpedeytiki Etairia Athens, Greece – The “AYGHI” newspaper)

The Problem Historical newspapers are one of the most signicant source of information forresearchers due to the wealth of information they provide regarding every aspectof everyday political, social and intellectual life. Access to this type of archival material is usually obstructed by the following factors: • In order to protect the archival material from potential damage somearchives prohibit the access to the largest part of their collection. • Direct contact with the original archival material constitutes a potentialhealth hazard (due to dust and fungi). • The lack of indexes to newspapers combined with the vastness of informationcontained in them makes research a very time consuming task. Many archives adopted digitization of newspapers as a straightforward methodto deal with the above problems. Digitized material is easier to preserve andmuch easier to distribute via the Web. However, conversion of archival materialinto a digital image format (i.e. JPEG, TIFF, PDF or DJVU) does not solvethe problem of rapid access to this material. Digitization itself is inadequate ifit does not provide the means of rapidly accessing the digitized material in atimely and accurate manner (also known as the searchability issue).

Current State of the Art newspaper Digitization Practices Currently there are three main approaches for rendering newspaper archival material searchable: • The Physical Features Based Approach • The OCR Based Full Text Indexing Approach. • The Conceptual Classification (Ontology Based) Approach.

The physical features based classification approach. Newspapers are classified using a basic set of metadata regarding physical features of the original material (number of issue, date of publication, newspaper name, number of pages etc). Advantages: • Simple to implement. Disadvantages: • The final user is unable to conduct full-text searches on an article or issue level basis. • The final outcome of the digitization effort resembles more a browsing mechanism. • There is no explicitly defined conceptual structure of the archive. Institutions: • Anno: Austrian newspapers online project (http://deposit.ddb.de/online/exil/exil.htm). • “Exilpresse digital. deutsche exilzeitschriften 1933-1945" project (http://deposit.ddb.de/online/exil/exil.htm). • Denmark: Digitaliserede danske aviser 1759-1865 (http://www.statsbiblioteket.dk).

The OCR based Full Text Indexing Approach. Automatic digitization approaches that make use of OCR analysis of digitized newspapers. Full Text Indexing techniques are currently considered to be the state of the art in the area of newspaper digitization and this is mainly for the following reasons: • Creation of searchable full - text index via OCR is a much faster process compared to the manual creation of metadata. • Separation of searchability and readability. • It is possible to conduct searches at a page/issue/article level basis. • The search is conducted via keywords in a manner that is familiar to the average user of contemporary Web Search engines. • Efficient content dissemination over the Web. Disadvantages: • Well known precision/recall issues. • Newspaper archives are not as chaotic as the Web. • The search of information in OCR based information retrieval systems is conceptually blind. • The import process a computationally expensive procedure.

The OCR based Full Text Indexing Approach. Institutions adopting this approach: • British library online newspaper archive (http://www.uk.olivesoftware.com/). • The Brooklyn Daily Eagle online (http://www.brooklynpubliclibrary.org/eagle/). • Northern New Nork historical newspapers (http://news.nnyln.net/). • Utah Digital Newspapers (http://www.lib.utah.edu/digital/unews/). • Historical newspapers in Washington (http://www.secstate.wa.gov/history/newspapersname.aspx). • To mention just a few…

The conceptual classification approach. The conceptual classification approach overcomes many of the above weaknesses by enabling the user to perform a knowledge engineering task upon the already digitized material via the use of ontologies. An Ontology: "the specifcation of ones conceptualization of a knowledge domain". Advantages: • Ontologies are used to express a specific conceptual view over the digitized material. • The use of top level ontologies guarantees to a certain extent the semantic interoperability among different archives. • The user may use concepts that classify the document that are not initially contained within the document itself. Disadvantages: • Given the density of information in a newspaper, production of metadata is a notoriously time consuming task (knowledge engineering bottleneck). • It is almost impossible to manually define all the semantic relations or entities contained even in a single article in a timely manner.

The DIATHESIS Approach: a hybrid approach This system attempts to implement a realistic conceptual classification approach by combining the best elements from the three approaches mentioned above: • It permits searches on a newspaper issue basis (newspaper issue name, number, publication date) in a similar manner to the physical features based approach. • It permits searches on an article level basis via the use of full text queries in a similar manner to the OCR based Full Text Indexing Approach. • It permits searches on an article level basis via the semantic relationships assigned to each segment. • It permits searches that combine all of the above elements. The system DOES not attempt to create a complete semantic structure that includes all the semantic relationships and entities (Actors, Places) described in the text. Instead it focuses to the creation a coherent semantic backbone that can be easily enriched with semantic relations. DIATHESIS is using CIDOC – CRM as an underlying ontology.

Aims of DIATHESIS • To render the digitized newspapers searchable on a document/article level basis. • To exploit the use of OCR technology in order to enable full text search in a newspaper collection. • To combine full text search with user-defined metadata based search on a document and article level basis in order to enhance the overall precision factor of the system. • To provide visualization facilities and an ergonomic interface for: • The timely completion of metadata according to a set of predefined thesauri hierarchies. • The browsing of the digitized newspaper collection given a set of predefined thesauri hierarchies. • To deal with issues of semantic interoperability of digitized material (conformance to international standards). • To create a robust semantic backbone that will allow the full implementation of the CIDOC CRM Model.

About CIDOC • What is the CIDOC Conceptual Reference Model? • An Object Oriented Ontology of about 80 classes and 130 properties for cultural and natural history • CRM instances can be encoded in many forms: RDBMS, ooDBMS, XML, RDF(S), OWL. • Accepted as ISO-21127 in June 2005 The CRM • Is not a metadata standard • It is meant to become our language for semantic interoperability, • It is aConceptual Reference Modelfor analyzing and designing cultural information systems • Is limited to the underlying semantics of database schemata and document structures used in cultural heritage and museum documentation • Does not define the terminology used to document these data structures • Does not say what cultural institutions should document • Aims to explain the logic of what they actually do document

An Example Hierarchy: E70 Stuff (Thing)

E52 Time-Span E53 Place E39 Actor 7012124 E38 Image E31 Document “Yalta Agreement” E39 Actor E39 Actor CIDOC Example (1): Modeling an Activity February 1945 P82 at some time within P7 took place at P11 participated in E7 Activity “Crimea Conference” P86 falls within P67 is referred to by E65 Creation Event * P81 ongoing throughout P14 performed P94 has created

CIDOC Example (2): Describing a composite artifact

E31.Document (Newspaper Issue) E31.Document (Newspaper Issue) E31.Document (Newspaper Issue) E31.Document (Newspaper Issue) E73.Information_Object (Newspaper Page) E73.Information_Object (Newspaper Page) E73.Information_Object (Newspaper Page) E73.Information_Object (Newspaper Page) P106F.is_composed_of P106F.is_composed_of P106F.is_composed_of P106F.is_composed_of P106F.is_composed_of P106F.is_composed_of P106F.is_composed_of P106F.is_composed_of E73.Information_Object (Newspaper Page) E73.Information_Object (Newspaper Page) E73.Information_Object (Newspaper Page) E73.Information_Object (Newspaper Page) P67F.refers_to P67F.refers_to P67F.refers_to P67F.refers_to P67F.refers_to P67F.refers_to P67F.refers_to P67F.refers_to E7.Activity (Newspaper Segment) E7.Activity (News) E7.Activity (Newspaper Segment) E7.Activity (Newspaper Segment) P67F.refers_to P67F.refers_to P67F.refers_to P67F.refers_to E7.Activity (Newspaper Segment) E7.Activity (Newspaper Segment) E7.Activity (Newspaper Segment) E7.Activity (News) E7.Activity (Newspaper Segment) E7.Activity (Newspaper Segment) E7.Activity (Newspaper Segment) E7.Activity (News) CIDOC-CRM DIATHESIS implementation: Issue/Segments Relationships

E31.Document (Newspaper Issue) E35.Title (Newspaper Title) P102F.has_title E63.Beginning_of_ Existence (Newspaper Publication Date) P92B.was_brought_into_existence_by P67F.refers_to E54.Dimension (Number of pages) E7.Activity (News) P43F.has_dimension CIDOC-CRM DIATHESIS implementation: Issue Physical Features

E31.Document (Newspaper Issue) P67F.refers_to E2.Temporal_Entity P4F.has_time-span E7.Activity (News) E39.Actor (literal) P14F.carried_out_by E70.Stuff (literal) P16F.used_specific_object E53.Place (literal) P7F.took_place_at E55.Type (literal) P2F.has_type (Article Full text) P3F.has_note SIS-TMS Controlled Vocabulary CIDOC-CRM DIATHESIS implementation: Activity References

Thesauri Hierarchies

CIDOC based newspaper annotation CIDOC CRM Core Ontology Integration by Factual Relations Donald Johanson Discovery of Lucy Johanson's Expedition real world nodes (KOS) Lucy Hadar Ethiopia Benaki Museum Documents in Digital Libraries

The System Architecture: Software Components Apache Tomcat Application Server Newspaper Digitization Suite Diathesis Administrator Diathesis Annotation Mechanism DIATHESIS Web Search Database SIS-TMS Thesaurus Management System Client Side Server Side

The System Architecture: Workflow View

The user interface FEATURES: • Fully Web Based. • Simple to use / Easy to learn. • IntelligentUpload / Download Mechanism. • Workflow Control . • Data Loss Prevention Mechanism (Temporary Local Storage and Data Recovery). • Flexible and Ergonomic Completion of Metadata Fields. • Automatic Highlighting of keywords in OCR Text (Actors, Places). • Use of SVG thesauri hierarchies for the timely completion of Vocabulary Reserved Metadata fields.

The user interface DIATHESIS End User Search Mechanism Administrator Annotation Mechanism Search for Subjects Usage Stats Search for Issues Mass Import System Configuration

Demonstration: Annotation Interface

Demonstration: End User Search Mechanism

Future Directions • Enrich the metadata creation process with Information Extraction Techniques. • Expand the suite with complementary Deep Semantic Annotation Capabilities (Semantic Wiki) PHASE 1 PHASE 2 PHASE 3 DIATHESIS Semantic Wiki Information Extraction Techniques Material Preprocessing Phase Shallow Semantic Annotation – metadata production phase. Deep Semantic Annotation – full CIDOC implementation phase

Conclusions • The use of OCR technology in newspaper digitization practices is a hot new technology. However it is not capable to deal with a plethora of issues. • Deep Semantic annotation via Semantic Web technologies is a promising future trend. CIDOC CRM provides the theoretical means to achieve this. The problem is how to implement it. Creation of deep semantic relationships that exist within the boundaries of a single newspaper issue is a time – consuming , and therefore expensive task. • The DIATHESIS digitization suite encapsulates a digitization strategy towards the creation of a vast semantic network of factual relationships between CIDOC entities while effectively dealing with the following issues: • Digitization and Storage of Newspaper Material • Rendering digitized material searchable on an issue/article level basis via the use of metadata, thesauri hierarchies and full text queries. • Create a semantic backbone that can be used by future implementations. • The next step: Link the DIATHESIS semantic backbone with a Semantic Wiki.

Thank You! geomark@ics.forth.gr martin@ics.forth.gr

THE DIATHESIS NEWSPAPER DIGITIZATION SUITE

THE DIATHESIS NEWSPAPER DIGITIZATION SUITE

Presentation Transcript

The Newspaper Lead

Digitization Projects:

Crossing State Lines for Collaborative Newspaper Digitization

Digitization

Herbarium Digitization

Digitization

THE SUITE

Digitization Aftermath

In-House Digitization: The National Digital Newspaper Program at the University of Kentucky

DIATHESIS HAEMORRHAGIS

Digitization by the Numbers

Bleeding Diathesis

Page Image Compression for Mass Digitization Harvard Test Suite Images JPEG2000

Digitization

The Macroalgae Digitization Project

Psychological Behavioral Biological Diathesis-Stress

The daily newspaper

Digitization:

Digitization Programmes

The Newspaper

Digitization Services

HCAL Digitization