460 likes | 635 Views
Digital Archiving – A Workflow . K P Raghuraman National Centre for Science Information Indian Institute of Science, Bangalore. NAMASKARA. Acknowledgements. Organizers Mr. Francis Jayakant Mr. Filbert Minj Friends who supported me in the effort Internet. Digital Archiving.
E N D
Digital Archiving – A Workflow K P Raghuraman National Centre for Science Information Indian Institute of Science, Bangalore NAMASKARA
Acknowledgements Organizers Mr. Francis Jayakant Mr. Filbert Minj Friends who supported me in the effort Internet Archives and Publication Cell, IISc
Digital Archiving • What is Digital Archive • Documented Information & storage system • Holds permanent, fixed data for a long time (?) in a structured and easy accessible way • Employs information architecture configured to assure trustworthiness and long term retention Archives and Publication Cell, IISc
Digital Archiving – Need A practical task for keeping documents intact for future use Improved access to information resources, preservation and dissemination as required Any time; anywhere and any place Archives and Publication Cell, IISc
Digital Archiving – Benefits • Digitisation contribute to • Conservation of physical resources • Enables effective sharing of information and contributes to knowledge flow • Unlocks information that was previously difficult to access in paper form • Use of digital surrogates will reduce wear and tear of originals / made legible • Negate the use of originals • Access to information could be restricted with remote access • Provide customizable user interface for collaborative working environment • Faster support regarding any query & question • Cost saving on paper & Time saving in finding information Archives and Publication Cell, IISc
Digital Archiving – Advantages • Improved searching mechanisms • Metadata search - Full text search - Boolean search • Support simultaneous searching in a standardised form, across a range of resource categories. • Information, rather than media, can be collated to support a query, regardless of the original source material type. • Space save • 3000 kg of paper could be saved in a DVD • Data can be recombined for manipulation and compressed for various applications Archives and Publication Cell, IISc
Digital Archiving – Technology and Process • Digital record is mirror image of original analogue/paper based file in terms of • Page layout and number of pages • Hand written text, graphics & logos • Colour of original document • These images is then rendered into desired format (e.g. pdf) for archiving, printing and distribution • Creation of Metadata – used for search and index • Additional metadata providing contextual information • Who uses the records • How will they be used • When will be they used • Access codes to prevent unauthorized access Archives and Publication Cell, IISc
Digitisation Crude definition Scan Save Is it just Scan and Save Is there a workflow Are guidelines for the whole process Archives and Publication Cell, IISc
Digitisation Definition Converting written and printed information into electronic form Creation of computerisation of a printed analog. Contents Contents – text image, audio or combination of these (multimedia) Archives and Publication Cell, IISc
Objective of Digitisation • Create content of databases • Facilitate access • Preservation • Dissemination of information resources Archives and Publication Cell, IISc
Digitisation Process Output • Electronic Document • Tagged Image File Format (TIFF) • Portable Document Format (PDF) • Useful for hosting information on the intranet • Platform independent • PDF readers are available as free downloads Archives and Publication Cell, IISc
Digitisation - Objects and Process Image Text Audio Video Scanner captures images. Software analyses images and creates texts and images Software converters convert raw Audio and raw Video to standard digital format Archives and Publication Cell, IISc
Digitisation - Issues Hardware Computer Scanner Software Communication software PC – Scanner – TWAIN complaint Image processing – Photoshop, Macromedia Fireworks etc. Enable text material to be converted to Text i.e. OCR (Optical Character Recognition) – AABBYY, OmniPage Suitable Policy Consistent quality threshold for scanned images. Choosing appropriate image format – TIFF, JPEG etc. Choosing an appropriate file name scheme. Archives and Publication Cell, IISc
Scanners Flat bed scanners Normal Desktop scanner Sheet fed scanners Same as above but here document moves and scan-head is immobile Handheld scanner Used to capture text – size of a pen. Drum scanner Used in publishing industries Planetary Scanner Scanning books Archives and Publication Cell, IISc
Types of Images 1-bit black and white – either black or white Used for printed text or line graphics Unsuitable for images 8-bit grey scale – 256 grey scales Black and white photographs Non-color documents 8-bit color – 256 colors low quality images 24-bit color – 16.8 million shades of color Ideal archival quality images For color photo printing Archives and Publication Cell, IISc
Resolution Measurement in dots per inch (dpi) Higher dpi higher the file size Archives and Publication Cell, IISc
Image - Size Images size measured in pixels Image size varies with scanned resolution Modification of image size is called resampling Image screen pixels are found on each pixel of the screen One screen pixel contains one image pixel and can have any RGB value 800 x 600 pixels 14” monitor 1024 x 786 pixels 16” monitor Archives and Publication Cell, IISc
Image – File Formats Some standard image formats TIFF – Tagged Image File Format JPEG – Joint Photographic Expert Group DjVu – déjà vu (a free file format) GIF – Graphic Interchange Format PNG – Portable Network Graphics Archives and Publication Cell, IISc
TIFF Multiple images and data in the same file Tags in file header (information on size, compression) Loss-less format, useful for archival images Platform independent Format useful for future modification – can edited without compression loss Disadvantage Size of image is very high Archives and Publication Cell, IISc
JPEG Strongest format for web images and printing images Superior quality can be produced Variety of compression capability Best method for online viewing Disadvantage Lossy compression format Archives and Publication Cell, IISc
GIF Very old format Lossless compression format Less storage space Strong candidate for graphic art and drawing. Disadvantage Limited to 256 colors. Archives and Publication Cell, IISc
DjVu File format to save scanned images especially with text. Advanced technology for image layer separation of text and images. High quality readable images, stored in minimum space – useful for web. Progressive loading – useful for web. Format used for Million books project Archives and Publication Cell, IISc
PNG A new format Created to improve on GIF format Supports 24-bit color or greyscale Provides for variety of transparency Lossless data compression Disadvantage New so old software does not support Archives and Publication Cell, IISc
File Formats Audio Wav Microsoft, IBM audio file format. Lossless storage method – large files. MP3 – MPEG -1 Audio Layer-3 Popular digital audio encoding. Lossy compression format so smaller files. Still can produce good reproduction of original. Real Audio – ram Variety of audio codecs from lowbitrate to high fidelity formats Streaming audio format Archives and Publication Cell, IISc
File Formats Video MPEG 21 Defines “Rights Expression Language” standard Sharing digital rights/permissions/restrictions for content from content creator to consumer XML based file system Can communicate machine readable license information in a "ubiquitous, unambiguous and secure" manner. The main objective of the MPEG-21 is to define the technology needed to support users to exchange, access, consume, trade or manipulate Digital Items in an efficient and transparent way. Archives and Publication Cell, IISc
OCR Optical Character Recognition Goal – Recreate text and other elements like tables and layout so as to edit in popular word-processors Requirement – Scanner and text conversion software (OCR) Technology – Examines patterns of dots and recognizes them and writes them as alphabetic characters and numbers Archives and Publication Cell, IISc
OCR - Process The scanner or camera produces TIFF image The software cleans the image for noises and starts recognizing patterns Recognized patterns in alphabets and numbers Unrecognized patterns into images Archives and Publication Cell, IISc
Widely used settings 24 –bits color 600 dpi (while 300 or 400 for text are popular) TIFF Rev 6 without compression or LZW compression (PNG is currently becoming popular) Photographs to be scanned twice the size B&W photographs in grey scale Text can also use the above settings can be stored as PDF or DjVu Archives and Publication Cell, IISc
Popular Practices Followed Initially Preservation Masters are created. Should be uncompressed to retain archival integrity For long time storage purposes. Compressed Web files are created for surrogate files in repository or for web-site Archives and Publication Cell, IISc
Specific File Formats Archives and Publication Cell, IISc
OCR - Accuracy Depends Color of paper Characters should be reasonably well formed The font should one of the popular ones. 99% accuracy achieved Bleached white paper 10pt character size 1.5 line spacing Computer based printouts Archives and Publication Cell, IISc
OCR - Issues Deal with archival material Old text printed during hand pressed period Gothic and exotic fonts used Paper color is yellow Characters are often broken and not well-formed due to age and environment factors Archives and Publication Cell, IISc
Best Practice First scan and store as TIFF files OCR TIFF files Depending on the application and size can convert it into pdf or any format Depending on accuracy of OCR use TIFF or OCR copies for pdf Archives and Publication Cell, IISc
OCR – Software AABBYY – Fine Reader – Very popular OMNI Page – High end OCR tool Read IRIS – A competitor to AABBYY and OMNI Page MODI – Microsoft Office Document Imaging (introduced in Win-XP and exports to word) Archives and Publication Cell, IISc
Camera produces raw uncorrected color photo of the each page Archives and Publication Cell, IISc
The software cleans up the image and saves as Hi-Res TIFF image Using OCR it can converted to editable text Archives and Publication Cell, IISc
Summary Digitization is a process Large number of analogue items like image, text, audio and video are captured into digital form Understand the variables and tasks in the process Methods of capturing images Conversion process performed Archives and Publication Cell, IISc
Summary Documentthe workflow This will lead to life history for each digitized item Help Create Consistency and Reliability Archives and Publication Cell, IISc
New Definition Is this the end of digitization? Are we through with the work? As in every other job here too sustainability and maintenance is necessary Archives and Publication Cell, IISc
Long term maintenance Technology is changing rapidly Obstacles that may need to overcome Lack of awareness in general about how such resources may be exploited effectively for scholarly purposes Lack of relevant IT skills and/or analytical methods Lack of appropriate user support. Archives and Publication Cell, IISc
Strategies to preserving data Preserving the data and the hardware and software platforms from which they are originally made accessible. Refreshing data by copying them periodically onto new storage media. Migrating data through changing technical regimes by rendering them into an appropriate standard interchange formats. Emulating the look and feel of the original data on successive generations of hardware and software platforms. Archives and Publication Cell, IISc
Points to ponder Unlike paper, parchment and other traditional forms of recording medium, electronic systems and their data are not durable. Digital materials have very different preservation requirements to analogue materials, which may last for many decades through storage in optimal environmental conditions. The other difficulty with electronic data and files is that they require the intervention of other systems to facilitate readability or usability. This innate dependency makes the files themselves very fragile. A problem in any of the supporting components can render the information useless. It is not enough to physically preserve the storage medium or present the bitstream. Without the commensurate tools to decode and present the bitstream, a future user will be met with gibberish. Archives and Publication Cell, IISc
Digitization – Next Step Will mean preservation of materials that are ‘born digital’ . Migration Electronic data transferred from one data format to another. Emulation Attempts to use current and future technologies to emulate the tools and logic used when the records and files were originally created Archives and Publication Cell, IISc
Informative web sites Irish Virtual Research Library and Archive - Project Workbookhttp://www.ucd.ie/ivrla/workbook/wdigpreservation.html The Arts and Humanities Data Service (AHDS) is a UK national service aiding the discovery, creation and preservation of digital resources in and for research, teaching and learning in the arts and humanitieshttp://ahds.ac.uk/about/publications/index.htm Archives and Publication Cell, IISc