220 likes | 404 Views
Bielefeld Academic Search Engine (BASE): an End-user Oriented Institutional Repository Search Service. Dirk Pieper/Friedrich Summann Bielefeld UL. Part 1: Institutional Repository Servers BASE: concept and content Creating a special view on institutional repository server collections
E N D
Bielefeld Academic Search Engine (BASE):an End-user Oriented Institutional Repository Search Service Dirk Pieper/Friedrich Summann Bielefeld UL
Part 1: Institutional Repository Servers BASE: concept and content Creating a special view on institutional repository server collections Demo: BASE user-interface and further visions Part 2: OAI dataflow, BASE dataflow Repository information in registries OAI harvesting problems Further developments of BASE Overview:
Definition: “A digital collection capturing and preserving the intellectual output of a single or multi-university community.” (Raym Crow, http://www.arl.org.sparc/IR/ir.html) IR servers exist of course also outside the university community IR servers appear as simple web sites, database systems with OAI interface, … Institutional Repository Servers:
BASE uses Fast Data Search BASE contains intellectual selected resources with focus on OAI-Servers but also web crawled content BASE displays result lists as bibliographic data and full text hits BASE frontend is written in PHP using the search API from Fast Data Search BASE offers sorting, search refinement and search history BASE: concept and content
TUNING, ADMINISTRATION and DEBUGGING SEARCH Search API CONNECTORS WEB CRAWLER FILE TRAVERSER BASE: concept and content Pipeline Pipeline QUERY & RESULT PROCESSING DOCUMENT PROCESSING INDEX FILES FILTER Pipeline
BASE: concept and content At present 2,7 mio documents in 189 collections, 15 of them web crawled data
Special view on IR server collections • Collections are listed in configuration file [ftubirmingham] url = "http://eprints.bham.ac.uk/" desc_de = "The Univ. of Birmingham: Eprints Archive" desc_en = "The Univ. of Birmingham: Eprints Archive" descdd_de = "Birmingham Univ." descdd_en = "Birmingham Univ." • Collections can be clustered for user-interface, e.g. “Institutional Repositories Europe” consists of [ftubarcelona], [ftubath], [ftubristol] , [ftuhelsinki], … • Parametric search possible • Frontend is ready for multi view (independent views with own configuration and layouts on the same backend)
Vision: search in Google Scholar Try your search on Google Scholar ...
Vision: check citations in Google Scholar Check citations (citing articles) in Google Scholar ...
OAI dataflow at Bielefeld UL OAI-Data Harvesting Dissertations, monographs (fulltext) Articles (fulltext) PubMed, Euclid, ArXiv, CiteSeer, Citebase, DOAJ articles All ressources (texts, images, video,references .... OPAC Article Database BASE Internal Index (FAST)
BASE dataflow Database Records Web Pages OAI-Data Harvesting Pre-Processing Processing Internal Index (FAST) User interface (PHP)
Repository information in registries • Openarchives.org (383) • Eprints Registry (607) • Univ. of Illinois Registry (1000) • DSpace Registry (28) • Directory of Open Archive Repositories (324)
OAI-compliant univ. repositories in BASE 4 3 18 33 USA 76 Canada 13 South America 2 Africa 2 India 3 Australia 11 New Zealand 1 3 14 55 2 6 3 1 12 7 12 16 2
Tools for the Harvesting Environment • Open Source Harvester (FS Consulting, Perl with modifications) • XML Validator and Repairer (Bielefeld UL, based on Perl XML modules • OAI Harvest Watcher (Bielefeld UL, Perl) • OAI Resource Updater (Bielefeld UL, Perl) • OAI Registry Watcher (Bielefeld UL, Perl)
OAI harvesting challenges • Repositories do not response or deliver Error Messages • Links to the Document do not work • XML file is not well-formed • Data contain only References without any Fulltext • Access to fulltext is restricted • Field content varies
OAI Harvesting: Problems in Practice 1 <source>http://xxx.xxx.uni-xxxxx.de/publications/ ELibD905_diplom_allnoch.pdf</source> <dc:creator>Barry Wellman,Jeffrey Boase,Kakuko Miyata</dc:creator> <dc:subject>Barry Wellman,Jeffrey Boase,Kakuko Miyata The Mobile-izing ....</dc:subject> <dc:title>Talk P. Bruzzone</dc:title> <dc:creator>Bruzzone </dc:creator> <dc:creator>Pierluigi</dc:creator> <dc:date>2004-07-05</dc:date> <dc:type>Review </dc:type><dc:identifier>http://www.rbej.com/content/2/1/52 </dc:identifier> Reproductive Biology and Endocrinology 2004, 2:52 doi:10.1186/1477-7827-2-52
OAI Harvesting: Problems in Practice 2 - Variations of <dc:language> EN: 9910 ENG: 771 En: 566 Eng: 1 English: 24084 English (United States): 63 English and Greek: 1 English and Russian: 1 English/Japanese: 1 English; Russian: 1 English=en: 1 Translation into English: 2 en: 1279115 en-CA: 865 en-US: 3 en-es: 5 en-us: 8 en;: 2 en_UK: 618 en_US: 18456 eng: 186787 eng : 92 eng + dut: 2 eng;: 17 eng; fre; ger;: 141 ....
Some Rules from Harvesting Practice • Standard repository software is great - for OAI harvesting as well • Small collections – small problems • Getting the related fulltext is complicated • Libraries produce better metadata • Writing e-mails helps - sometimes • Data aggregation may produce problems
Further Developments: BASE Interfaces • Search form (working) • HTTP calls (working) • Web Service (in development) • Federated Search (Vascoda) (in discussion)
Local Integration: Search Form <form action="http://www.base-search.net/index.php" method="post" accept-charset="UTF-8"> <input maxlength="512" name="q" type="text" size="50" /> <input value="Search!" type="submit" /> <input value="all" name="s" type="hidden" /> </form>