960 likes | 974 Views
***. Text information storage and retrieval and the CDS/ISIS program. Paul NIEUWENHUYSEN pnieuwen@vub.ac.be University Library, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussel, Belgium. ***. What is a database?.
E N D
*** Text information storage and retrieval and the CDS/ISIS program Paul NIEUWENHUYSEN pnieuwen@vub.ac.be University Library, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussel, Belgium
*** What is a database? • A database is a collection of similar data records stored in a common file (or collection of files).
*** Software type =information retrieval software • Software for information storage and retrieval (ISR software) • Text(-oriented) database management systems (Text-DBMS) • Text information management systems (TIMS) • Document retrieval systems • Document management systems
*** Information retrieval: via a database to the user Informationcontent Linear file Inverted file Database Search engine User Search interface
*** Information retrieval: the basic processes in search systems Information problem Text documents Representation Representation Query Indexed documents Evaluation and feedback Comparison Retrieved documents
*** Information retrieval systems: many components make up a system • Any retrieval system is built up of many more or less independent components. • These components can be modified to increase the quality of the results more or less independently.
*** Information retrieval systems: important components the information content system to describe formal aspects of information items system to describe the subjects of information items concrete descriptions of information items = application of the used information description systems information storage and retrieval computer program(s) computer system used for retrieval type of medium or information carrier used for distribution
*** Information retrieval systems: the information content • The information content is the information that is created or gathered by the producer. • The information content is independent of software and of distribution media. • The information content is input into the retrieval system using • a system (rules) to describe the formal aspects • a system (rules) to describe the contents (classification, thesaurus,...)
*** Information retrieval systems: media used for distribution • Hard copy (for information retrieval systems only in the broad sense) • Print • Microfiche • For computers: (for information retrieval systems strictu sensu) • Magnetic tape • Floppy disk; optical disk (CD-ROM, CD-i, Photo-CD,...) • Online
*** Information retrieval systems: the computer program The information retrieval program consists of several modules, including: • The module that allows the creation of the inverted file(s) = index file(s) = dictionary file(s). • The search engine provides the search features and power that allow the inverted file(s) to be searched. • The interface between the system and the user determines how they (can) interact to search the database (using menus and/or icons and/or templates and/or commands).
*** What determines the results of a search in a retrieval system? • the information retrieval system ( = contents + system) • the user of the retrieval system and the search strategy applied to the system Result of a search
*** Characteristics / definition of structured text-information • The text information is structured.(files, records, fields, sub-fields, links/relations among records,...) • The length of records and fields can be “long”. • Some fields are multi-valued, i.e. they occur more than once.
*** Layered structure of a database Database File Records Fields Characters + in many systems:relations / links between records
*** Structure of a bibliographic file Record No. 1 Title Author 1: name + first name Author 2: ... Source Descriptor 1 Descriptor 2 ... Record No. 2 Sub- fields Repeated fields
*** Thesaurus: description • Thesaurus = • system to control a vocabulary + • the contents of this vocabulary • Thesaurus program = program to create, manage, modify and/or search a thesaurus using a computer
*** Thesaurus relations Term(s) with broader meaning BT (= Broader Term) RT (Related Term) UF (= Use For) Other term(s) TermSynonym(s) NT (= Narrower Term) Term(s) with narrower meaning
*** Thesaurus applications • To find/choose index terms to add these to items, when terms are taken from a controlled vocabulary • To find more and/or better terms to search a database (to increase recall and precision) • To find more and/or better terms during writing • To understand the meaning of a term, by inspecting • the scope note of the term and/or • the relations with other terms
*** Database systems: why study this subject briefly ? • To achieve a better understanding of the inner workings of the external information retrieval systems that you use, so that you can exploit these more efficiently • To be able to evaluate the quality of database systems you are confronted with, so that you can • make better choices among available systems, • offer constructive suggestions to the manager, • ...
**- Database systems: why study this subject in detail? To acquire the knowledge and skills to create / set up / manage your own local database system on a computer
*** Database systems: definition A database (management) system is a program or set of programs, providing a means by which a user can easily store and retrieve data in the form of “databases”.
**- Information retrieval software: related terms • Software for information storage and retrieval (ISR software) • Text(-oriented) database management systems (Text-DBMS) • Text information management systems (TIMS) • Document retrieval systems • Document management systems
**- Information retrieval software: applications (Part 1) • Documentation centres • Archives • Libraries • Musea • Medical files • Marketing departments • Schools • Bibliographic databases Documents Archived documents Books / Documents Objects / Books / ... Patient’s histories Clients / Potential clients Courses / Teachers Publications / ...
**- Information retrieval software: applications (Part 2) • Meeting calendars • Product information • Laboratories • Personal documentation • Patent office • Co-operating information networks • ... Meetings = conferences Product descriptions Recipes Documents Patents Documents / Persons / Institutes / Events / ...
**- Cataloguing: hard copy versus computer-based • Hard copy • “Input” , i.e. cataloguing, on cards determines directly the “ouput”, i.e. the format of the data on the card as presented to the user • Summarized: INPUT=OUTPUT • Computer-based • Input in the database in fields allows later output in various formats for presentation • Summarized: 1. INPUT, 2. various OUTPUTs
*** Text-information management systems: characteristics and definition The information in the database is text oriented.Therefore, several features are required: • ability to store relatively long blocks of texts • ability to retrieve items in which specific words or terms occur anywhere
**- Text-information management: from free-form to structure Free form text information without structure Text database with information structured in files, records, fields, sub-fields, with links/relations among records,...(Ideally, each fields is repeatable = can be multi-valued, = can occur more than once in each record.)
Software type Word processing software Free-form or structured text information database software Features Must be learnt anyway.Slow sequential searching. Additional software to be purchased and learnt.Fast searching via index(es). *** Text-information management: types of software
**- Advantages of structuredtext-retrieval versus X-base systems Feature • Many long fields, forming long records • Repeatable fields • Subfields • Variable field lengths • Fast searching any word in all fields • Thesaurus to help searching Text-retrieval Yes Yes Yes Yes Yes Yes X-base systems No No No No No No
*** Hierarchy in the use of a database Database structure Input / Editing Searching / Output
*** Functions of database management software • Input / edit using keyboard or batch input • Indexing of the database(s) • Browse / Search / Select / Retrieve data from database • Output (Sort / Display / Print to file / Print to paper) + • Export / Import
*** !? Question !? Task !? Problem !? Which advantages offers a document management system on computer?
*** Advantages of a document system on computer, for the user(s) • Access to information is easier. • Access to information is faster. • Online access is possible even when centre is closed. • Online access is possible from a distance. • Integration in search module with data on loan status. • More elements of the records can serve as search term. • Combinations of search terms can be used. • Results /selections can be stored as computer files.
**- The CDS/ISIS text database management program • Software to create and manage local, in-house databases with primarily structured text as contents (NOT numbers, graphics, sound,...) • Versions available for • Mainframes (IBM) • Minicomputers (Digital VAX) • Microcomputers (DOS )
*-- Micro-CDS/ISIS: original main menu on the display
*-- CDS/ISIS database definition services: display menu
*-- CDS/ISIS database definition table: display of an example
*-- CDS/ISIS manual data entry, editing / input services: display menu
**- Batch input / Import • Is batch input possible? • Is a format conversion program included or available? • ...
**- Activities related to indexing • Activity • Intellectual, human indexing • Develop an automatic indexing method • Automatic indexing Who does it? Database producer / Thesaurus producer Database producer / Software features Computer with program Concrete action Attribute subject terms to records Making an index method file Making inverted file(s)
**- Indexes in books and databases: a comparison Book Database Index_term_1 page x1, y1, z1,... Index_term_2 page x2, y2, z2,... ... Printed Invisible • Index_term_1 record nr. x1 / field type nr. x1 / field occurrence x1 / position x1 • record nr. y1 / field type nr. y1 / field occurrence x1 / position y1 • ... • Index_term_2 record nr. x2 / field type nr. x2 / field occurrence x2 / position x2 • record nr. x2 / field type nr. x2 / field occurrence x2 / position x2 • ... • ...
**- Index in a text retrieval system (such as CDS/ISIS) Terminology: Index = Inverted file = Dictionary database dictionary on display database complete inverted file
**- Methods of inverted file creation Æ Word indexing J Simple / automatic / no indication required L Loss of word context J A field structure is not required Æ Phrase indexing L Indication of phrases during input is required J Richer than separate words J A field structure is not required Æ Field indexing J Simple / automatic / no indication required J Context is better preserved L A field structure is required
*-- CDS/ISIS inverted file services: display menu
**- Automatic indexing (file inversion) • Word indexing? with proximity indexing? • Field indexing? • Sub-field indexing? • Phrase indexing? Æ Maximum length of index entry? Æ List of stopwords available? Æ Immediately after input or in batch? (Slow down...?) Æ Indexing speed? Æ Adding prefixes/tags possible? Æ Modification of indexing possible? Possible? Obligatory?
**- !? Question !? Task !? Problem !? Why can the index of a database be so large in comparison with the size of the database?
*-- CDS/ISIS information retrieval services: display menu
*-- CDS/ISIS information retrieval: example of a dictionary on the display
**- Output from a database to various “devices” • to video display • to printer • to computer file (“printing” to a file) =< ;
*-- CDS/ISIS output (sorting and printing) services: display menu
**- Formatting of data within each record in output • Independent of output device: • Determine the sequence of the fields in each record. • Omit specific fields from each record. • Add field names or tags to the fields in each record. • Indicate the search term(s) in each record. • Dependent of output device: • Specify character formats in each (sub)field: typeface + size + bold/italic/underline