530 likes | 731 Views
Ian H. Witten New Zealand Digital Library Project Computer Science Department Waikato University New Zealand http://greenstone.org. Browsing around a digital library. Greenstone: Open source system for creating and delivering digital library collections. Agenda. Context
E N D
Ian H. Witten New Zealand Digital Library ProjectComputer Science DepartmentWaikato UniversityNew Zealand http://greenstone.org Browsing around a digital library Greenstone: Open source system for creating and delivering digital library collections
Agenda • Context • Documents and interfaces • Different document types • … and interface languages • Searching and browsing • Different search indexes • … and browsing functionality • Collection configuration • (Using the Collector) • The power of open source
What we wanted Greenstone turns a ragtag menagerie of documents in various formats into an easy-to-use collection that can run on a standalone laptop in a Ugandan village’s information center ALA 2002
What we wanted • “Collections” of digital material • Individualized, depending on metadata etc • Up to several Gb of text … • … + associated images, movies, whatever • Fully searchable • Served on WWW, or published on CD-ROM • Multi-platform (Unix + all Windows) • Multi-format documents • Multi-lingual: documents and interfaces • Multimedia • Metadata: standard and non-standard
Collections: on the Web nzdl.org (demo, not service)
Greenstone collections: on CD-ROM UN and NGOs, e.g. • UNESCO • Global Help Project • United Nations University • World Health Organization • Pan American Health Organization
Kataayi Multipurpose Cooperative Rural Uganda(20 km fromMasaka)
HumanityDevelopment Library Example for sustainable development andbasic human needs 160,000 pages 30,000 images 1230 books 340 kg US$20,000 CD-ROM US$6 Win3.1x(!)/95/98/NT Stand-alone and intranet server Web browser user interface Global Help Project, Antwerp (+ UN agencies)
Agenda • Context • Documents and interfaces • Different document types • … and interface languages • Searching and browsing • Different search indexes • … and browsing functionality • Collection configuration • Using the Collector • The power of open source
Collection of pictures (pictures of text) Alexander Turnbull Library, NZ
Voice (and pictures) Hamilton Public Library
Chinese documents (pictures of text) + Chinese interface Peking University Library
Chinese (Chinese & English interfaces) Classic Chinese literature
Arabic (Arabic & English interfaces) Famous mosques
French UNESCO, Paris
Spanish PAHO, WHO
Russian collection fromMari El Republic http://gov.mari.ru/gsdl
Agenda • Context • Documents and interfaces • Different document types • … and interface languages • Searching and browsing • Different search indexes • … and browsing functionality • Collection configuration • Using the Collector • The power of open source
Hierarchical document model • Metadata specifiedat any level Title metadata
Searching and browsing • Searching • Metadata-based browsing Subject Title Publisher “HowTo” Dublin Core ad hoc
Multiple search indexes text metadata
Collection-dependent metadata
Browsing using classifiers AZList classifier (Title metadata)
Metadata extraction plugins Acronym extraction plugin
Phrase hierarchy extraction + thesaurus browsing
Agenda • Context • Documents and interfaces • Different document types • … and interface languages • Searching and browsing • Different search indexes • … and browsing functionality • Collection configuration • Using the Collector • The power of open source
Collection configuration file creator sjboddie@cs.waikato.ac.nz maintainer sjboddie@cs.waikato.ac.nz public true beta true indexes section:text section:Title document:text defaultindex section:text plugin GAPlug plugin ArcPlug plugin RecPlug classify Hierarchy hfile=sub.txt metadata=Subject sort=Title classify HDLList metadata=Title classify Hierarchy hfile=org.txt metadata=Organization sort=Title classify List metadata=Howto format SearchVList "<td valign=top>[link][icon][/link]</td> <td>{If}{[parent(All': '):Title],[parent(All': '):Title]: } [link][Title][/link]</td>" format CL4VList "<br>[link][Howto][/link]" format DocumentImages true format DocumentText "<h3>[Title]</h3>\\n\\n<p>[Text]" collectionmeta collectionname "greenstone demo" collectionmeta collectionextra "This is a demonstration collection for the Greenstone digital library software.\nIt contains a small subset (11 books) of the Humanity Development Library" collectionmeta iconcollectionsmall "/gsdl/collect/demo/images/demosm.gif" collectionmeta iconcollection "/gsdl/collect/demo/images/demo.gif" collectionmeta .section:Title "section titles" collectionmeta .document:text "entire books" collectionmeta .section:text "chapters“ • name, icon, etc • description • email of creator • search indexes • plugins • classifiers how to format • documents • query results • classifiers
Alter configuration indexes document:Title • Add full-textindex of titles • ... or authors • Add alphabetic author browser • Include Word documents • Include PDF documents • Separate index for each language • Extract acronyms and add list • Import OAI metadata • Extract phrase hierarchy and addbrowser • Alter the format of any of the above • Restrict collection’s interface langs • Change default interface language additional indexes line … need author metadata add classifier line add plugin line (same) add languages line plugin option add plugin line add classifier line add format string add format string edit site config file indexes document:Creator classify AZList –metadata Creator plugin WordPlug plugin PDFPlug languages en fr es plugin PDFPlug –extract_acronyms plugin OAIPlug classify phind format … format PreferenceLangs en|fr|es cgiarg shortname=1 argdefault =fr
Agenda • Context • Documents and interfaces • Different document types • … and interface languages • Searching and browsing • Different search indexes • … and browsing functionality • Collection configuration • Using the Collector • The power of open source
The pen is mightier than the sword! Building and distributing collections carries responsibilities … legal … social … ethical … Be aware of the power of information and use it wisely Collector = software “wizard” for building new collections
Agenda • Context • Documents and interfaces • Different document types • … and interface languages • Searching and browsing • Different search indexes • … and browsing functionality • Collection configuration • Using the Collector • The power of open source
The power of open source: Greenstone uses … • Ghostscript • Kea • pdftohtml • rtftohtml • TextCat • wvWare • Xlhtml • XML::Parser Interpreter for Adobe Postscript documents (Postscript plugin) Keyphrase extraction program (to generate metadata) Converter for PDF documents (PDF plugin) Converter for RTF documents (RTF plugin) Detects languages and document encodings Converter for Word documents (Word plugin) Converter for Excel/Powerpoint documents (plugins) Parses XML documents, used to read and write Greenstone’s internal XML document format
and … • MG • GDBM • wget • YAZ • Stemmer • GCC • CVS • Perl • Apache Creates compressed full-text indexes and performs searches Database used for metadata etc Downloading pages from the Web when creating collections Client and server implementation of Z39.50 English language stemmer C/C++ compiler Version control system Used for plugins etc Web server used by many Greenstone installations
Greenstone DL software • Accessible via any Web browser • Server runs on Windows and Unix • Collections can be published on CD-ROM Access • Full-text and fielded search • Flexible browsing facilities • Metadata-based (Dublin Core) • Collection-specific • Hierarchical phrase browsing supported • Creates all access structures automatically Searching/browsing • Plugins — new document, metadata formats • Classifiers — new metadata browsers Extensible • Documents and interfaces • Chinese, Arabic, Maori, Russian etc (+ European) • Multimedia: video, audio collections exist Multilingual Distributed • CORBA protocol allows remote access • Z39.50 server/client for backwards compatibility What you see — you can get! • Open-source software: free, extensible