440 likes | 572 Views
Michael Middleton QUT School of Information Systems, Brisbane, Australia m.middleton@qut.edu.au for STIMULATE 5 Vrije Universiteit Brussel Brussels, Belgium July, 2005. Controlled vocabularies : Thesauri and information retrieval. Introduction. Context ….. History
E N D
Michael Middleton QUT School of Information Systems, Brisbane, Australia m.middleton@qut.edu.au for STIMULATE 5 Vrije Universiteit Brussel Brussels, Belgium July, 2005 Controlled vocabularies:Thesauri and information retrieval
Introduction • Context ….. History • Vocabulary principles • Thesaurus software • Thesaurus building …. application • Thesaurus evaluation • The future
Context: Information life cycle create • Organise to maintain distribute dispose store use reuse maintain recall
Context: Information management Domains • Operational • Analytical • Strategic
Context: indexing • Producing representations of records or documents that constitute a finding aid to the records in a database or to part of a document • Assigned indexing • Derived indexing
Indexer qualities • The ‘Art’ of assigned indexing: • Empathy • Meticulousness • Consistency • General knowledge • Patience
Indexing guidelines • Conceptual analysis and assigning • Aboutness • Elements of the document to consider • Exhaustivity • Specificity • Index what is in the item • Co-ordination
Assigned index representations • Alphabetical Subject • Classified • Alphabetical • Notation • Chain
Indexing exercise How consistent is database indexing? Example: the same paper in multiple databases: Middleton, M Skills expectations of library graduates http://eprints.qut.edu.au/archive/00000094/ • Index it yourself • Compare your indexing with others • Compare the indexing in ERIC and INSPEC
Context: metadata • Agent • Document description • Responsibility • Administrative • Provenance • Connections • Conditions of use
Context: metadata • Content • Topic (application of vocabulary control) • Coverage • Role
Controlled vocabulary • Thesaurus • A controlled vocabulary of terms in natural language that are designed for post-coordination • Classification scheme • A scheme for organisation by categories in a systematic manner; this may involve grouping by subject, function or other criteria, or determining document naming conventions • Often involves notation
Purpose • Indexing by translating diverse natural language to consistent terminology • Establishing relationships among terms • Information retrieval improving precision and recall
History • Bibliographic databases • Many applications, list of online associated thesauri and classification schemes at http://sky.fit.qut.edu.au/~middletm/cont_voc.html • Standards • ISO2788; ISO 5964 • ANSI Z39.19
Thesaurus principles • Term relationships • Continuing evolution • Internally consistent hierarchies to support database searching
The Thesaurus • The vocabulary of a controlled indexing language formally organised so that the a priori relationships between concepts are made explicit. • A thesaurus is an example of metadata
35 mm CAMERAS BT MINIATURE CAMERAS CAMERAS BT OPTICAL EQUIPMENT NT MOVING PICTURE CAMERAS STEREO CAMERAS STILL CAMERAS UNDERWATER CAMERAS RT PHOTOGRAPHY CINE CAMERAS BT MOVING PICTURE CAMERAS NT UNDERWATER CINE CAMERAS RT CINEMA CINEMA RT CINE CAMERAS DIVING RT UNDERWATER CAMERAS INSTANT PICTURE CAMERAS SN Cameras which produce a finished print directly BT STILL CAMERAS Land cameras USE VIEW CAMERAS MICROSCOPES BT OPTICAL EQUIPMENT MINIATURE CAMERAS BT STILL CAMERAS NT 35 mm CAMERAS MOVING PICTURE CAMERAS BT CAMERAS NT CINE CAMERAS TELEVISION CAMERAS OPTICAL EQUIPMENT NT CAMERAS MICROSCOPES PHOTOGRAPHY RT CAMERAS Thesaurus extract (ISO sample)
Standardising the Vocabulary • Types of entities & forms of terms • Singular vs plural • Homonyms • Choice of terms • Scope notes and history notes
Compound terms • Terms should be factored into simpler elements to improve user’s understanding. • Semantic factoring • Syntactic factoring
Semantic Relationships • Equivalence • Establishing relationships between preferred (postable) and non-preferred (non-postable) terms • Hierarchical • Establishing relationships between subordinate and superordinate terms. These may be distinguished as: • Generic • Whole-part • Instance • Associative • Establishing relationships between terms that are mentally associated, but not equivalent or hierarchical
… but, the Functions thesaurus Whereas • agenda papers might have • broader termdocuments In a functions thesaurus • agenda papers might have • broader termmeetings
Applying a functional thesaurus Top Term • PERSONNEL Scope Notes The function of managing all employees …… Related Terms • COMPENSATION • ESTABLISHMENT • INDUSTRIAL RELATIONS etc, etc Narrower Terms • ALLOWANCES • APPEALS (Decisions) • APPOINTMENT • ARRANGEMENTS • AUTHORISATION • COMMITTEES • COMPLIANCE etc, etc Use For Terms • Employees • Public Servants • Staff
Thesaurus Display • Alphabetical hierarchies • One level above and below entry term • Complete hierarchy for each term or separate TT display • Permuted term lists • Combination with classification notation • Graphic Displays
Applying a thesaurus Download Term Tree from http://www.termtree.com.au Free trial download from
Thesaurus software • Assigned • Integrated database • Deriving terminology
Thesaurus software - assigned Terms are assigned by vocabulary specialists in independent database • a.k.a.™ • Synercon Management Consulting • MultiTes • OpenCyc • SuperTHES • from THESmain/THESshow for mono-/multilingual thesauri • Term Tree 2000 • WebChoir • Wordmap
Thesaurus software – integrated database Terms are assigned by specialists, thesaurus works like active data dictionary to control database • BASIS • InMagic Bibliotech PRO • BRS/Search • STAR
Thesaurus software for deriving terminology Terms are created automatically from text • Entrieva • SemioTagger™, SemioMap™ and SemioSkyline™ for viewing • Intology • taxonomy builder • Verity • Thematic Mapping • Autonomy • taxonomy generation & categorization
Thesaurus Building - 1 • Users • Define • Identify needs • Define Thesaurus range & depth • Raw vocabulary building • Identify sources • Collect and record terms
Thesaurus Building -2 • Vocabulary organisation • Cluster terms • Establish relationships using symbols • Maintenance
Business application • Not long term collaborative efforts of classification specialists • Instead, adapt to business changes • Not just descriptions of present business processes • Instead, reflect strategic planning, competitors • Not necessarily a single taxonomy • Instead, multiple overlapping taxonomies
Content management • Describe content as it’s being created rather than classify after creation • User-needs orientation
Integrating taxonomies • Accurate reporting • Exchange of data • Assist resource discovery • Information retrieval
Thesaurus evaluation • Qualities • Information retrieval evaluation
Thesaurus Qualities • Scope and features description • Display forms • Correctness of hierarchies • Use of scope, history and qualification • Adherence to standards • Syndetic measures • Connectedness • Accessibility
Thesauri & Retrieval evaluation • Cranfield experiments & since • Recall and precision • Influence on indexing • Conceptual analysis • Translation failure • Omissions • Exhaustivity/Specificity • Syntax and ‘false drops’ • Maintenance costs
Post-controlled vocabularies • Use of a ‘Hedge’ of terms to represent a broad concept, eg: • ‘psychological aspects of..........’ • ‘........in Australia’ • ‘....review items on.....’
Still to come …… Research areas • Metathesauri • Super – interlinked vocabularies (e.g. NLM) • Semantic Web • Enhancing word association with usage statistics like links (e.g. THESUS)
Review • Controlled vocabulary types • Software support • Business processes • Website • http://sky.fit.qut.edu.au/~middletm/cont_voc.html • (about to move to database driven site – redirection will be applied)