1 / 27

CS 430: Information Discovery

CS 430: Information Discovery. Lecture 16 Thesauruses and Gazetteers. Shared Work!!!. Some programs for Assignment 2 had sections of identical code! This is not acceptable. 1. If you incorporate code from other sources, it must be acknowledged. 2. If you work with a colleague:

ismael
Download Presentation

CS 430: Information Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 430: Information Discovery Lecture 16 Thesauruses and Gazetteers

  2. Shared Work!!! Some programs for Assignment 2 had sections of identical code! This is not acceptable. 1. If you incorporate code from other sources, it must be acknowledged. 2. If you work with a colleague: (a) You must write your own assignment. (b) You should acknowledge the joint preparation. IF YOU HAVE NOT FOLLOWED THESE PRINCIPLES, CONTACT ME DIRECTLY.

  3. Course Administration Midterm examination • Wednesday, October31. 7:30 to 9:00 • Room: To be announced. • Three questions based on readings and lectures • Open book Sample examination • See the Notices page on the course web site for last year's midterm and a set of PowerPoint slides that discuss the solutions. (This examination had four questions.)

  4. Examination Suggest that you bring: Text book Copies of lecture slides Discussion class readings The examination will be on only material covered in the lectures and in the discussion classes. The objective is to reward people who regularly attend class and prepare thoroughly for the discussion sections.

  5. Course Administration Syllabus changes Because of a National Science Foundation meeting that was rescheduled after September 11: • Lecture on December 4 is cancelled. • Topics for other lectures have been reordered. • There are no changes in the readings or assignments.

  6. Lexicon and Thesaurus Lexicon contains information about words, their morphological variants, and their grammatical usage. Thesaurus relates words by meaning: ship, vessel, sail; craft, navy, marine, fleet, flotilla book, writing, work, volume, tome, tract, codex search, discovery, detection, find, revelation (From Roget's Thesaurus, 1911)

  7. Thesaurus in Information Retrieval Use of a thesaurus in indexing (precoordination) A. Manual Used to guide human indexer to assign standard terms and associations. computer-aided instruction see also education UF teaching machines BT educational computing TT computer applications RT education RT teaching From: INSPEC Thesaurus

  8. Thesaurus in Information Retrieval Use of a thesaurus in indexing (precoordination) B. Automatic Divide terms into thesaurus classes. Replace similar terms by a thesaurus class. 408 dislocation 409 blast-cooled junction heat-flow minority-carrier heat-transfer n-p-n p-n-p 410 anneal point-contact strain recombine transition unijunction From: Salton and McGill

  9. Desirable Properties for Information Retrieval • Thesaurus is specific to a subject area. Contains only terms of interest for identification within that subject area. • Ambiguous terms are coded only for the senses important for that field. • Target is that each thesaurus class should include terms of moderate frequency. Ideally the classes should have similar frequency.

  10. Art and Architecture Thesaurus • Controlled vocabulary for describing and retrieving information: • fine art, architecture, decorative art, and material culture. • Almost 120,000 terms for objects, textual materials, images, • architecture and culture from all periods and all cultures. • Used by archives, museums, and libraries to describe items in their • collections. • Used to search for materials. • Used by computer programs, for information retrieval, and natural • language processing. • A project of the J. Paul Getty Trust

  11. Art and Architecture Thesaurus • Provides the terminology for objects, and the vocabulary necessary to describe them, such as style, period, shape, color, construction, or use, and scholarly concepts, such as theories, or criticism. • Concept: • a cluster of terms, one of which is established as the preferred term, or descriptor. • Categories: • associated concepts, physical attributes, styles and periods, agents, activities, materials, and objects.

  12. Art and Architecture Thesaurus: Sample Record Record ID: 198841 Descriptor: rhyta Note: Refers to vessels from Ancient Greece, eastern Europe, or the Middle East that typically have a closed form with two openings, one at the top for filling and one at the base so that liquid could stream out. They are often in the shape of a horn or an animal's head, and were typically used as a drinking cup or for pouring wine into another vessel. Hierarchy: Containers [TQ] ...<containers by function or context> ...........<culinary containers> ...................<containers for serving and consuming food>

  13. Art and Architecture Thesaurus: Sample Record (continued) Terms: rhyta rhyton (alternate, singular) protomai protome rhea rheon rheons Related concepts: stirrup cups sturzbechers drinking vessels ceremonial vessels

  14. MeSH -- Medical Subject Headings • Controlled vocabulary for indexing articles, for cataloging books and other holdings, and for searching MeSH-indexed databases, including MEDLINE. • • About 19,000 primary subject headings • • Thesaurus of 110,000 chemical terms. • • Total vocabulary over 300,000 terms. • National Library of Medicine provides MeSH subject headings for each of the 400,000 articles that it indexes every year. • "MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts."

  15. MeSH -- Medical Subject Headings • MeSH hierarchy: • general terms, e.g., anatomy, organisms, diseases, biological sciences; • anatomy is divided into sixteen topics, e.g., body regions and musculoskeletal system; • body regions is divided into sections, e.g., abdomen, axilla, back • etc.

  16. Example of MeSH hierarchy Biological Sciences [G] Biological Sciences [G01] + Health Occupations [G02] + Environment and Public Health [G03] + Biological Phenomena, Cell Phenomena, and Immunity [G04] + Genetics [G05] + Biochemical Phenomena, Metabolism, and Nutrition [G06] + Physiological Processes [G07] + Reproductive and Urinary Physiology [G08] + Circulatory and Respiratory Physiology [G09] + Digestive, Oral, and Skin Physiology [G10] + Musculoskeletal, Neural, and Ocular Physiology [G11] + Chemical and Pharmacologic Phenomena [G12] +

  17. Example of MeSH hierarchy (continued) Physiological Processes [G07] Adaptation, Physiological [G07.062] + Aging [G07.168] + Body Constitution [G07.265] + Body Temperature [G07.315] Body Temperature Regulation [G07.315.232] + Skin Temperature [G07.315.753] Chronobiology [G07.450] + Electrophysiology [G07.453] + Fluid Shifts [G07.503] Growth and Embryonic Development [G07.553] + Homeostasis [G07.621] + Tensile Strength [G07.900] Tropism [G07.950] +

  18. Example of MeSH hierarchy (continued) MeSH Heading Body Temperature Tree Number E01.370.600.120 Tree Number G07.315 Entry Term Organ Temperature See Also Fever See Also Thermography See Also Thermometers Allowable Qualifiers DE GE IM PH RE Unique ID D001831

  19. Observations about Manually Maintained Thesaurus • Permit very rich structure of relationships • Most effective when user of search system is skilled in the discipline and trained in the use of the thesaurus (e.g., medical librarian) • Needs continually updating as a field develops new terminology • Expensive to create and maintain

  20. Gazetteers The Alexandria Digital Library (ADL): geolibrary at University of California at Santa Barbara where a primary attribute of objects is location on Earth (e.g., map, satellite photograph). Geographic footprint: latitude and longitude values that represent a point, a bounding box, a linear feature, or a complete polygonal boundary. Gazetteer: list of geographic names, with geographic locations and other descriptive information. Geographic name: proper name for a geographic place or feature (e.g., Santa Barbara County, Mount Washington, St. Francis Hospital, and Southern California)

  21. Alexandria Thesaurus: Example canals A feature type category for places such as the Erie Canal. Used for: The category canals is used instead of any of the following. canal bends canalized streams ditch mouths ditches drainage canals drainage ditches ... more ... Broader Terms: Canals is a sub-type of hydrographic structures.

  22. Alexandria Thesaurus: Example (continued) canals (continued) Related Terms: The following is a list of other categories related to canals (non-hierarchial relationships). channels locks transportation features tunnels Scope Note: Manmade waterway used by watercraft or for drainage, irrigation, mining, or water power. » Definition of canals.

  23. Use of a Gazetteer • Answers the "Where is" question; for example, "Where is Santa Barbara?" • Translates between geographic names and locations. A user can find objects by matching the footprint of a geographic name to the footprints of the collection objects. • Locates particular types of geographic features in a designated area. For example, a user can draw a box around an area on a map and find the schools, hospitals, lakes, or volcanoes in the area.

  24. Alexandria Gazetteer: Example from a search on "Tulsa" Feature name State County Type Latitude Longitude Tulsa OK Tulsa pop pl 360914N 0955933W Tulsa Country OK Osage locale 360958N 0960012W Club Tulsa County OK Tulsa civil 360600N 0955400W Tulsa Helicopters OK Tulsa airport 360500N 0955205W Incorporated Heliport

  25. Challenges for the Alexandria Gazetteer Content standard: A standard conceptual schema for gazetteer information. Feature types: A type scheme to categorize individual features, is rich in term variants and extensible. Temporal aspects: Geographic names and attributes change through time. "Fuzzy" footprints: Extent of a geographic feature is often approximate or ill-defined (e.g., Southern California).

  26. Challenges for the Alexandria Gazetteer (continued) Quality aspects: (a) Indicate the accuracy of latitude and longitude data. (b) Ensure that the reported coordinates agree with the other elements of the description. Spatial extents: (a) Points do not represent the extent of the geographic locations and are therefore only minimally useful. (b) Bounding boxes, often include too much territory (e.g., the bounding box for California also includes Nevada).

  27. Examples of Gazetteers Alexandria Digital Library Linda L. Hill, James Frew, and Qi Zheng, Geographic Names: The Implementation of a Gazetteer in a Georeferenced Digital Library. D-Lib Magazine, 5: 1, January 1999. http://www.dlib.org/dlib/january99/hill/01hill.html Getty Thesaurus of Geographic Names http://www.getty.edu/research/tools/vocabulary/tgn/

More Related