260 likes | 431 Views
1. Reinhard Altenhöner. Changing preservations tasks for the German National Library: Some insights and preliminary remarks IFLA International Newspaper Conference 2010 at IGNCA, New Delhi in India during 26th February to 28th February, 2010 "Digital Preservation and Access to news and views”.
E N D
1 Reinhard Altenhöner Changing preservations tasks for the German National Library: Some insights and preliminary remarksIFLA International Newspaper Conference 2010 at IGNCA, New Delhi in India during 26th February to 28th February, 2010 "Digital Preservation and Access to news and views” | IFLA2010. Newspaper section | 2010-02-26
2 ToC • Starting situation / setting • Digital Preservation in DNB • Practical Example: E-Papers | IFLA2010. Newspaper section | 2010-02-26
3 DNB: Our task: Collecting and archiving, providing permanent access • Publications issued in Germany since 1913 • Since June 22, 2006: Online- / Net-publications are covered by the new law • Newspapers as well: Ca. 450 newspapers (this means selection!) are microfilmed every day • About 9.000 datasets in the central database • Some years ago we started some brainstorming on alternatives for this MF-approach • collecting e-papers from the web • Archiving of print-files • Cooperation with media / clipping agencies | IFLA2010. Newspaper section | 2010-02-26
4c Characteristics Online-newspapers • Frequent update-processes • Dedicated publication workflow: database, Content-Management-System, presentation on the fly • Web 2.0-facilities for comments, blogging & tagging • Multiple ways of embedded advertisement • Complex navigation and search functions • Harvesting extremely difficult some experiments (e.g. on newsletters), but no running workflow | IFLA2010. Newspaper section | 2010-02-26
5 „kopal“ • Co-operative development of a long-term digital information archive • Start in 2004 • Task: Development of a standardized long-term preservation solution to facilitate resp. solutions for other libraries / industries • Basis: DIAS (Digital Information and Archiving System) of the Royal Dutch Library, condensed and extended with peripheral open-source • Enhancement for cooperative usage • Development of an universal object scheme • Hosting outside the library (remote access) | IFLA2010. Newspaper section | 2010-02-26
6 GWDG: Hosting IBM: Archiving SW kopal: cooperation SUB: Ingest/Acess SW DNB: Ingest/Acess SW Common task: Preservation Planning | IFLA2010. Newspaper section | 2010-02-26
7 kopal: Structure & concept DNB (Frankfurt) GWDG (Göttingen) DIAS by IBM Local software Account 1 SUB Göttingen Account 2 Local software Partners nn | IFLA2010. Newspaper section | 2010-02-26
8 UniversalObjectFormat Submission Information Package Packaging Object METS 1.4 HeaderdmdSecamdSecFile SectionStructural Map Mets.xml LMER 1.2 – Long-term preservation Metadata for Electronic Ressources | IFLA2010. Newspaper section | 2010-02-26
Online-Archivist Machine Interface koLibRI Administration Interface | IFLA2010. Newspaper section | 2010-02-26
10 Kopal preservation strategy • Migrate object with urn xxx into new format yyy • Migrate all objects • of format xxx and/or • that have been ingested before a certain date and/or • that are larger than zzz MB • into new format xyz (e.g. from TIFF to PNG) • Implementation of emulation view paths • No restriction as of file size or file format / type – all known and unknown file formats are being accepted (text, pictures, video, audio, executables, ... etc.) | IFLA2010. Newspaper section | 2010-02-26
11 Digital newspapers in DNB • Some results (collections) from digitisation projects • Simple graphics-data • access in a dedicated system • Including full text OCR & access • Online-Newspapers: Some pre-studies on objects like „Spiegel“ – but no running workflow • Concentration on e-papers | IFLA2010. Newspaper section | 2010-02-26
12 Digitisation results in DNB 1 | IFLA2010. Newspaper section | 2010-02-26
13 Digitisation results in DNB 2 | IFLA2010. Newspaper section | 2010-02-26
14 E-papers in DNB Preliminary thoughts: Requirements • Structured normalised metadata-set:Article/photo – issue – newspaper • Persistent identification of each unique objects, linkage between them, citable • Added information for author / title on the article level is useful but not necessarily needed | IFLA2010. Newspaper section | 2010-02-26
15 E-paper requirements • Quantity: • One newspaper: ca. 150 articles per day / 900 a week / 47.000 per year • 21.150.000 per year • Start modestly • Retrodigitisation (collection started with 1913) will extend this to more than 1 bil. articles Challenge in terms of resources and technical capacities | IFLA2010. Newspaper section | 2010-02-26
16 E-paper project (recently started) • In cooperation with a vendor after a tender procedure • Ca. 20 important newspapers, starting with two • Metadata should be delivered in ONIX. • Harvesting Interface OAI-PMH • All data delivered in a XML-File • Integrated Digital Preservation in the kopal environment | IFLA2010. Newspaper section | 2010-02-26
17 XML record for e-Papers | IFLA2010. Newspaper section | 2010-02-26
18 E-Paper & Access • Principal question for access: Integration in Portal environment or dedicated (independent) search-area • Advanced requirements for segmentation of text • Direct link between portal (metadata) and text • Navigation / Browsing within the object, direct access to single chapters / pages • Zooming, scroll • Integrated Full text search • Print and Store facilities • DRM, IDM | IFLA2010. Newspaper section | 2010-02-26
Related books Year of printing, editions, authors, summary of the book…. Knowledge base Related music score Year of printing, editions, authors, summary of the book…. Logo Face Related films Year of printing, editions, authors, summary of the book…. CORE Film Information about actors, director, producers, music, sequence, year of production. Short description of the picture, video sequence… What is in the film, rights. Any other relevant information as short summary of content for fast access… Professionals Related songs Year of printing, editions, authors, summary of the book…. (Media archives…) Related internet links Year of printing, editions, authors, summary of the book…. MANTLE Related news Year of printing, editions, authors, summary of the book…. Automated (Learning) Text Person Speaker 1 Speaker 2 Title SHELL Image End-User (Wikipedia) Text 5 6 3 2 4 1 Semantic relation Open KnowledgeNetworks Semantic Multimedia-Search Content-analysis Automated optimisation digitisation 19 Reuse of results from CONTENTUS-project | IFLA2010. Newspaper section | 2010-02-26
Data processing 20 | IFLA2010. Newspaper section | 2010-02-26 Automated Page-segmentation(headlines, images, tables) OCR + entity recognition Full text search Semantic search interface Based on: Intellectual approved authority files Statistical data analysis | 20
21 Integrated search and retrieval Our solution currently | IFLA2010. Newspaper section | 2010-02-26
22 Next step: Integrated E-papers | IFLA2010. Newspaper section | 2010-02-26
23 Integrated E-paper „ZEIT“ 1 | IFLA2010. Newspaper section | 2010-02-26
24 Integrated E-paper „ZEIT“ 2 Bereitstellung von freien Texten | IFLA2010. Newspaper section | 2010-02-26
25 Integrated E-paper „ZEIT“ 3 | IFLA2010. Newspaper section | 2010-02-26
26 Reinhard Altenhöner mailto:r.altenhoener@d-nb.de http://www.d-nb.de | IFLA2010. Newspaper section | 2010-02-26