720 likes | 896 Views
Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 am Fall 2004. Lecture 21: Facetted Classification. SIMS 202: Information Organization and Retrieval. Agenda. Facetted Classification Traditional vs. Facetted Classification
E N D
Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 am Fall 2004 Lecture 21: Facetted Classification SIMS 202: Information Organization and Retrieval
Agenda • Facetted Classification • Traditional vs. Facetted Classification • Designing Facetted Classifications • Thesaurus Design • Assignment 6 • Discussion Questions • Action Items for Next Time
Agenda • Facetted Classification • Traditional vs. Facetted Classification • Designing Facetted Classifications • Thesaurus Design • Assignment 6 • Discussion Questions • Action Items for Next Time
Controlled Vocabularies • Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information • That is, it is an attempt to provide a consistent set of descriptions for use in (or as) metadata
Hierarchical Classification • Each category is successively broken down into smaller and smaller subdivisions • No item occurs in more than one subdivision • Each level divided out by a “character of division” (also known as a feature) • Example: • Distinguish “Literature” based on: • Language • Genre • Time Period Slide author: Marti Hearst
Hierarchical Classification Literature English French Spanish ... ... ... Prose Poetry Drama ... Prose Poetry Drama ... 16th 17th 18th 19th 16th 17th 18th 19th Slide author: Marti Hearst
Labeled Categories for Hierarchical Classification • LITERATURE • 100 English Literature • 110 English Prose • English Prose 16th Century • English Prose 17th Century • English Prose 18th Century • ... • 111 English Poetry • 121 English Poetry 16th Century • 122 English Poetry 17th Century • ... • 112 English Drama • 130 English Drama 16th Century • … • 200 French Literature Slide author: Marti Hearst
Faceted Categories • Mutually exclusive • Non-overlapping, distinct categories • Relational • Relations between facets, subfacets, and foci (elements) are not restricted to hierarchical generalization-specialization relations • Composable • Combined using grammars of order and relation to form compound descriptions
A Language a English b French c Spanish B Genre a Prose b Poetry c Drama C Period a 16th Century b 17th Century c 18th Century d 19th Century Aa English Literature AaBa English Prose AaBaCa English Prose 16th Century AbBbCd French Poetry 19th Century BbCd Drama 19th Century Faceted Classification Along With Labeled Categories Slide author: Marti Hearst
Ranganathan • PMEST Facets • P(ersonality) • WHO: Types of things • M(atter) • WHAT: Constituent materials • E(nergy) • HOW: Action or activity terms • S(pace) • WHERE: Where things occur • T(ime) • WHEN: When things occur
Entity Kind Part Property Material Process Operation Patient Product By-Product Agent Space Time “Classical” Facet Analysis
What is being done? Entity Kind Product By-Product What are its parts? Part What are its properties? Property Material How is this achieved? Process By what means? Operation By whom? Agent Patient Where? Space When? Time “Classical” Facet Analysis
Nouns Entity Kind Part Patient Product By-Product Agent Adjectives Property Material Intransitive Verb Process Transitive Verb Operation Adverb Space Time “Classical” Facet Analysis
Semantic relationships Is-A (thing/kind, genus/species) Mammals Primates Humans Has-Parts Human Head Eyes Syntactic relationships Compounds Wheat + harvesting = “wheat harvesting” Object + operation = operation on object Semantic and Syntactic Relationships
Faceted Classification • Clearly distinguishes between semantic relationships and syntactic relationships • Semantic relationships • Within a facet • Containment relations • Syntactic relationships • Across facets • Combinatoric relations • Have a “syntax” for syntactic combination of semantic terms
Power of Facet Combinations • The syntactic relations of faceted classifications enable a small controlled vocabulary to produce • Many, many structured descriptions • Complex, but formally structured descriptions using nested compound descriptions • Descriptions for things we do not have words for
Example: Objects Red Plastic Glass Blue Paper Straw
007 Personality Straw Glass Operation Drinking Slurping Sipping Material Plastic Paper Color Blue Red ARTery Color Size Material Weight Shape Radius/Circumference Density Volume/Capacity Function/Use Hardness/Softness Yin/Yang Project Team Facetted Classifications
Culture Feed Color Red Blue Material Plastic Paper Use Drink from Drink with Dimensions Circumference Height Diameter Picture Portal Color Red Blue Material Paper Plastic Use Containment Transport Shape Torus Planar # Holes 0 1 Project Team Facetted Classifications
F.U.N. Shape Color Material Rigidity Function Container Conduit Locale Weight Size MNM Functionality What it does What you can do with it Physical Properties Color Shape Material Project Team Facetted Classifications
pillBox Function Container Conduit Form Shape Cylinder Composition Paper Plastic Color Blue Red Size Tall and skinny Short and fat Team iTour Color Red Blue State Solid Non-porous Flexible Material Plastic Paper Geometry Cylindrical Hollow Function Container Drinking Sucking Blowing Project Team Facetted Classifications
Example: Objects Gray Metal Glass Two Yellow Plastic Straws
Function Form Shape Material Color Number Function: Drinking Form Shape: Cylinder Material: Plastic Color: Red Number: 1 Example: Objects
Agenda • Facetted Classification • Traditional vs. Facetted Classification • Designing Facetted Classifications • Thesaurus Design • Assignment 6 • Discussion Questions • Action Items for Next Time
Faceted Classification Design • Collect examples that need to be classified • Identify candidates for facets and subfacets • Test classification scheme on examples for facet orthogonality • Order foci within facets • Explicate grammar for ordering and combining facets and subfacets • Test classification scheme on examples for combinatoric power • Extend foci for comprehensiveness where applicable • Create new facets and subfacets where needed • Test classification scheme on new examples, especially boundary cases • Iterate and refine throughout
Facet Guidelines • Terms on the same level in the ontology should be of the same level and type • Facets, subfacets, and foci should have a discernible order • Use of capitalization and singular/plural forms should be uniform • Sports • Team Sports • Baseball • Football • Basketball • Solo Sports • Marathon Running • Sports • Team Sports • Baseball • Football • Basketball • Solo Sports • Marathon Running
Ordering Foci (“Array”) • Simple to complex • (Locomotions: walk, run, jump, skip, hurdle, cartwheel) • Common/popular to uncommon/unpopular • (Vegetarian Pizza Toppings: mushroom, onion, olive, artichoke, pineapple, pine nuts) • Spatial, geographical, or geometric • (Southwestern States: California, Nevada, Arizona, New Mexico ) • Chronological, historical, or evolutionary • (Dinosaur Eras: Triassic, Jurassic, Cretaceous) • Canonical (pre-established order) • (Playground Counting: Eenie, Meenie, Mynee, Mo) • Alphabetical • (Boy’s Names: Al, Bob, Chuck, David, Ed, Frank, George, Harry) • Size • (T-Shirts: Small, Medium, Large, XL, XXL)
Agenda • Facetted Classification • Traditional vs. Facetted Classification • Designing Facetted Classifications • Thesaurus Design • Assignment 6 • Discussion Questions • Action Items for Next Time
Why Develop a Thesaurus? • To provide a conceptual structure or “space” for a body of information • To make it possible to adequately describe the topical content of information resources at an appropriate level of generality or specificity • To provide enhanced search capabilities and to improve the effectiveness of searching (i.e., to retrieve most of the relevant material without too much irrelevant material)
Why Develop a Thesaurus? • To provide vocabulary (or terminological) control • When there are several possible terms designating a single concept, the thesaurus should lead the indexer or searcher to the appropriate concept, regardless of the terms they start with
Preliminary Considerations • What is used now? • Continue using an existing thesaurus? • Ad hoc modification of existing thesaurus? • Develop a new well-structured thesaurus? • What is the scope and complexity of the subject field? • What kind of retrieval objects or data will be dealt with? • How exhaustive and specific is the desired description of objects?
Preliminary Considerations • The scope and complexity of the field will provide some indication of the scope and complexity of the thesaurus • It is better to plan for a larger and more comprehensive system than a smaller system that rapidly will become inadequate as the database grows • Development of a good thesaurus requires a major intellectual effort as well as clerical operations like data entry and production of sorted lists
Development of a Thesaurus • Term selection • Merging and development of concept classes • Definition of broad subject fields and subfields • Development of classificatory structure • Review, testing, application, revision
Flow of Work in Thesaurus Construction Select Sources Define Broad Subject Fields Improve Class Structure Assign codes Sort Terms into Broad Subject Fields Print Classified Index and review Select Terms Define Subfields within one Subject Field Discuss with Experts and Users Record Selected Terms Work out detailed structure of the Subject Field Select descriptors and checklist items Revise as needed Many Modifications? Select Preferred Terms Sort Terms Yes No All Subfields of Broad Subject finished? No Merge identical Terms Assign Notation Yes Merge Terms in Same Concept class Produce Full Thesaurus and Check references All Broad Subjects finished? No Review and Test Based on Soergel, pp 327-333 Yes
1. Term Selection • Select sources for the collection of terms • Prearranged Sources • Open-ended Sources • Assign codes to each source • Selection of terms • For part of pre-arranged and for all open-ended sources • Enter terms into database with all information
1.1 Kinds of Sources • Prearranged Sources • Existing descriptor lists, classification schemes thesauri • This includes universal schemes like DDC or LCSH • Nomenclatures of single disciplines • Treatises on the terminology of a field • Encyclopedias, lexica, dictionaries and glossaries • Tables of contents of textbooks and handbooks • Indexes of journals or abstracting journals • Indexes of other publications in the field
1.1 Kinds of Sources • Open-ended sources • Lists of search requests or interest profiles • Description of projects/activities to be served by the information retrieval system • Discussion with specialists in the field • Sample of documents in the field • Ask users why and how these documents relate to the field • Have documents indexed by experts in the field • Lists of titles of documents in the field • Abstracts and reviews of documents • Your own knowledge
Selection of Sources • Prearranged sources require less effort in gathering the material, and may already indicate some relationships between terms and concepts and relationships among terms • Open-ended sources can reflect current terminology and may provide more complete coverage • Choose a set of sources that are current, as complete as possible, and considered authoritative
Selection of Sources • Each selected source is assigned an ID for tracking its use in the development of the thesaurus • Useful when making decisions about which terms to prefer • Useful for backtracking when questions arise (where did this come from?)
Selection of Terms • Terms can be transferred directly from prearranged sources to the recording medium (cards or database) • Have to decide which terms and references to include, or to take the whole source
Selection of Terms • In open-ended sources you read through the source and pick out terms (i.e. words and phrases) that might be useful in retrieval or as references to other terms • Alternatively, use keyword and phrase extraction software to create lists of terms and select from those • Transfer selected terms to the recording medium (cards or database)
2. Merging and Development of Concept Classes • Sort Term DB into alphabetical order • First Round • Merge information for identical terms, possibly pulling info from additional sources • Second Round • Merge synonyms or terms in the same concept class
3. Definition of Broad Subject Fields and Subfields • Define broad subject fields and sort terms into these broad fields • Define subfields within each broad field and sort terms into these subfields • Work out the detailed structure • Select preferred terms • Merge information for terms in the same concept class • Repeat these steps • For each subfield within a broad field • And for each broad field • Until all terms have been consolidated and preferred terms selected
4. Development of Classificatory Structure • Produce preliminary version of classified index and update the working database • Improve classificatory structure • Reality check • Produce and distribute a version of the classified index • Distribute to users/experts
5. Final Stages • Review • Testing • Application • Revision
Review • Discuss classified index with users/experts • Select descriptors and checklist descriptors • Assign notational symbols • Produce main thesaurus and indexes
Review (cont.) • Check cross references and insert where needed • Produce test version • Test by indexing • Modify as needed • Produce production version
Testing a Thesaurus • Assign descriptors to a sample set of NEW documents (use enough to get an idea of any gaps in the thesaurus) • Test retrieval using sample questions and seeing how effectively the thesaurus maps to the appropriate descriptor
Art and Architecture Thesaurus • http://orange.sims.berkeley.edu/cgi-bin/flamenco/aa/Flamenco
Agenda • Facetted Classification • Traditional vs. Facetted Classification • Designing Facetted Classifications • Thesaurus Design • Assignment 6 • Discussion Questions • Action Items for Next Time