240 likes | 328 Views
Current research information as part of digital libraries and the heterogeneity problem. Integrated searches in the context of databases with different content analyses . CRIS2002, Kassel. Jürgen Krause University of Koblenz-Landau and Social Science Information Centre (IZ-Bonn)
E N D
Current research information as part of digital libraries and the heterogeneity problem.Integrated searches in the context of databases with different content analyses.CRIS2002, Kassel Jürgen Krause University of Koblenz-Landau and Social Science Information Centre (IZ-Bonn) Lennéstr. 30, 53113 Bonn, Germany, mailto:Krause@bonn.iz-soz.de
“Scientists are increasingly using search engines to locate research of interest; some rarely use libraries, locating research articles primarily online ... About 85% of users use search engines to locate information.” • Lawrence/Giles (1999:107) “It doesn’t matter what you want to know, there are people in the Internet who already have this knowledge and want to help you” (Hahn, 1999: 107)
Weizenbaum 2000/2001 Germany • „Das Internet ist ein großer Misthaufen ...“ • ( „The internet is a large dunghill ...“ • Nov. 2000 „Gutenbergs Folgen“ Kongreß Mainz • Mai 2001 Fachseminar Hamburg • www.heise.de/newsticker/data/wst-03.05.01-001/
WWW today:a) “Worst case” of the ambiguity problem Out of the estimated 800 million pages on around three million servers, only 6% relate to the fields of science and education (by comparison: 1.5% relate to pornography). NEC 1999 NEC 2000:thousand million • the “worst case” for the ambiguity problem • No reasonable results can be obtained without additional conceptual components
Summary and consequences When used for specialist information retrieval (IR), general WWW search engines run counter to nearly every criterion which actually permits a successful search based on IR knowledge. This involves all the main components of an IR system, the database and its selection, the use of research logic and user expectations. Based on his/her knowledge of these aspects, the user should develop the best possible research strategy, something which is impossible with WWW search engines
Nevertheless WWW search engines have one advantage compared with current specialist databases: embedded in an enormous volume of irrelevant data is data which is not found in specialist databases and which may be of value to experts. This means that it is simply not possible to return to the recommendation to narrow down the search to the original specialist databases. New ways have to be found to make research, including WWW sources, more satisfactory than is the case at present using general WWW search engines.
Conceptual gaps Additional to technological integration: • Different stages of content analysis of textual data: • an intelligently indexed term in a library classification • …… • automatic full text indexing in fully unstructured data pools Descriptor A in one such system: wide range of meanings
Research Projects IZ Bonn ViBSoz„Social Science Virtual Library“, Virtual Library Project of the German Research Association (DFG) CARMEN „ Content Analysis, Retrieval and Metadata: Effective Networking“, special support program of the German Ministry of Education and Research (BMBF). ELVIRA “Electronic Retrieval and Analysis System for Industrial Associations”, funded by the German Federal Ministry of Economics and Technology ETB “The European Schools Treasury Browser” funded by the European Commission
Metadata U.S. Bureau of the Census: Integrated Information solutions – The future of census bureau data access and dissemination, Sept. 1999. Working paper “Recent surveys of Census Bureau customers show that two out of three use multiple data sets. ... If we continue to saddle data users with the burden of putting data from disparate sources into digestible forms, we do it at the risk of our own peril.“(p.2) “Solutions of these issues ... will remove around the further development of standards, metadata ...“ (p.3) “IIS will help minimize data user burden, data uncertainty and maximize data quality and usefulness through the use of metadata“ (p.2)
DIN SICT paper: German position „Strategie für die Standardisierung der Informations- und Kommunikationstechnik (ICT)“ (DIN Berlin 2002, draft) ... It is ... necessary to find a new concept relating to the still existing demand for consistency retention and interoperability. This concept can be described by means of the following premise: standardization must be considered in terms of the remaining heterogeneity. Only joint interaction between intellectual and automatic processes for the treatment of heterogeneity and standardization will produce a solution strategy which also ensures, under present-day marginal conditions, usable consistency and interoperability conditions (translation from German)
CARMEN and ViBSoz:Coping with heterogenity by transfer components Documents Metadata Retrieval • Coping with heterogeneity • Cross-concordances • Statistical transformation and neural networks • Deductive methods extract metadata from various document formats algorithmically
Mathematics – Physics: MSC and PACS statistical: PACS 62.30.+d Mechanical and elastic waves; vibrations (Mechanische und elastische Wellen, Schwingungslehre) MSC 74S15 Boundary element methods (Randelementmethode) intellectual: PACS 62. Not connected
Einfache Suche Suchbegriff Dominanz (dominance) Zahl der relevanten Treffer 16 Example: semantic-pragmatic relation G. Binder
Erweiterte Suche Transferbegriffe Dominanz, Messen, Mongolei, Nichtregierungsorganisation, Flugzeug, Datenaustausch, Kommunikationsraum, Kommunikationstechnologie, Medienpädagogik, Zahl zusätzliche relev. Treffer 7 Anteil der zusätzlichen relev. Treffer an den zusätzl. Treffern 50% • Mitglieder des Vereins wom@n reisten zur UNO-Frauenkonfernez nach Beijing. Auf der Fahrt durch die Mongolei und die Wüste ... G. Binder
Standard method: one step transformation • question non-differentiated handling of vagueness B A C document term sets
Two step transformation • question V1: Handling of vagueness between questions and terms A V2 B V3 C document term sets V2/V3: Bilateral handling of vagueness
B A C Jugendarbeitslosigkeit Jugendarbeitslosigkeit Youth unemployment SWD USB Köln Thesaurus A X from A X from A Broker Jugendlicher +Arbeitslosigkeit Y from B IZ Thesaurus IZ Soz. Thesaurus B Z from C Thesaurus C
Statistical and Neural networks transformation • Co-occurence-based similarity • In ViBSoz: statistical crosswalk between two different thesauri (SWD as a universal thesaurus and the IZ thesaurus for the social sciences as a special thesaurus), • in ELVIRA between a thesaurus for data and free text terms • Transformation networks • USB Thesaurus to the IZ Thesaurus • the USB Thesaurus or IZ Thesaurus to the IZ Precision LSI and Transformation network x Statistical methods Recall Fig. 3: Transformation network USB Thesaurus to IZ Thesaurus (Fig. 7-12 from Mandl 2000:206)
Conclusion Todays search engines do not adequately solve the problem of a worldwide search for relevant documents and data in a special scientific community. They only represent an incomplete, albeit valuable first step. Users want to interlink literature and research project databases with the catalogues of virtual libraries, the WWW homepages of science institutions and fact sources, e.g. data archives with their survey data. In this case integration should not be performed only on a technical level or using solely intellectually created links, as is the case at present. A key role is played here by automatic transfer between different content analysis methods and standardizations of the document sets to be integrated. Based on the initial empirical results of different IZ projects, the proposed strategy appears to be highly promising: vagueness problems are not treated non-specifically as a transfer between all documents and the query but will be done cognitively plausible with individual bilateral modules.