870 likes | 999 Views
Indexing & retrieval. Approaches to indexing. Key word indexing. Concept indexing. Social indexing. Non-text indexing. Keyword Indexing. Keyword indexing (1). Entity-oriented - draw terms from entity itself. Advantages:. Quick. How. to. succeed. in. graduate. school.
E N D
Approaches to indexing Key word indexing Concept indexing Social indexing Non-text indexing
Keyword indexing (1) Entity-oriented - draw terms from entity itself Advantages: • Quick How to succeed in graduate school
Keyword indexing (1) Entity-oriented - draw terms from entity itself Advantages: • Quick • Inexpensive • No vocabulary lag • Multiple access points • Accuracy • No intellectual effort needed
Keyword indexing (2) Disadvantages: • No control over synonyms, near synonyms • No control over homographs
Keyword indexing (3) Disadvantages: • Dependent on authors for informative and accurate titles Artificial metalloenzymes based on the biotin−avidin technology: enantioselective catalysis and beyond The golden peaches of Samarkhand
Keyword indexing (4) Disadvantages: • No control over word forms Communicating in the library or Communications in libraries
Keyword indexing (5) Disadvantages: • No cross reference structure
Historical key word indexing methodologies Uniterm cards Edge-notched cards Optical coincidence cards Key word in context (KWIC) Spatial indexing
Pre- versus post-coordinate indexing Mortimer Taube China—Folklore China—History China —Politics France —Folklore France —History France —Politics Germany —Folklore Germany —History Germany —Politics Russia —Folklore Russia —History Russia —Politics (12 terms) China, France, Germany, Russia, Folklore, History, Politics (7 terms)
Post-coordinate index searching History of France → France * History Two sets of documents France History Boolean AND search yields intersection of the two sets France AND History
Advantages to Taube's system No need to develop a list of authorized terms—pulling terms from documents themselves No need to articulate rules of punctuation for representing complex concepts (France—History) No need to delineate citation order (France—history v. History—France) No need to formulate rules for subheadings ("May subdivide geog.")
Uniterm cards One card per term Document no. 102 "Arrest statistics of the Arizona State Police" state 31 102 53 24 75 96 107 68 49 70 34 95 117 59 115 147 109 police 11 102 23 85 96 87 68 49 60 91 115 107 79
Searching with uniterm cards Query: looking for documents about state police state 31 102 53 24 75 96 107 68 49 70 34 95 117 59 115 147 109 police 11 102 23 85 96 87 68 49 60 91 115107 79 102 Arrest statistics of the Arizona StatePolice. 107 A short history of the Wisconsin StatePolice. 115 The modern police state.
Edge-notched cards One card per bibliographic item pet-care Whirdeaux, Ima Caring for your pet pterodactyl / by Ima Whirdeaux Call no. Q54321 .W45 bears Turner, Paige Caring for your pet grizzly / by Paige Turner Call no. Q12345 .T8 pterodactyls
Pyramid coding for edge-notched cards Coding the year 1947* 20 dots 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 10 dots 9 5 2 0 9 5 2 0 8 4 1 8 4 1 7 3 7 3 6 6 *They hadn't heard of the Y2K problem yet.
Optical coincidence cards Pre-printed cards with numbers for entire database fleas 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
Key Word in Context (KWIC) Index Stop word Stop word Doc 15 title: "A comparison of OCLC and WLN hit rates for monographs and an analysis of the types of records retrieved" CONTEXT ttems of remote users: an hit rates for monograph/A comparison of OCLC and WLN OCLC and WLN hit rates for onographs/ A comparison of arison of OCLC and WLN hit n analysis of the types of s of the types of records phs and an analysis of the A comparison of OCLC and KEY WORDS analysis of the types of comparison of OCLC and WLN hit rates for monographs and / monographs and an analysi/ OCLC and WLN hit rates for rates for monographs and / records retrieved. A com/ retrieved. A comparison / types of records retrieve/ WLN hit rates for monogra/ POINTER 15 15 15 15 15 15 15 15 15 15
Key Word Out of Context (KWOC) Index aardvark 101 baggage 123 banyan 128, 159, 179 coconut 955, 654 driving 196, 488, 788 elementary 455, 785 elephant 128, 465, 783 garage 678, 398 hardware 849, 483, 399 meter 768 nadir 877 noxious 112 opium 289 opus 985, 159, 849 people 629, 458 quark 137, 492 radar 968, 295 radio 430, 206, 749 stereo 294, 837, 873 television 745, 727, 883 ultraviolet 958, 774 zebra 276
Vector space model (VSM) Each document represented by a vector assistive technology Vector for document entitled "Assistive technology for libraries" libraries
Vector space model matching Similarity between query and document vectors assistive Vector for document 1 technology Vector for document 2 Vector for query libraries
VSM term weighting Assign high weights to terms that appear frequently in the document but infrequently in the database Term conclusion information blind Freq. w/in document low high high No. of documents with term high high low Query: "I'm looking for articles about assistive technology for the blind."
VSM refinements Adding semantic and syntactical parsing. Bill is going to the store to make a purchase. Bill is going to purchase the store. Bill is going to storehis purchase.
Concept indexing • Rather than pulling terms from documents, assign concept identifier (e.g. France—History) to documents dealing with history of France • Requires intellectual effort • Takes more time than key word indexing so less economical • Avoids problems of false coordination and synonymy through use of vocabulary control
Vocabulary control (1) One indexing term or phrase to represent a concept • Unidentified flying objects not flying saucers • Point user to correct term with "use" reference • Reduces number of searches needed to find items about a particular topic
Vocabulary control (2) One form of a word to represent the concept • Dictionaries not dictionary
Vocabulary control (3) One usage of a homographic term • Fault (geologic) not fault (responsibility for error) • Usage identified though scope note • Consistency among indexers as well as one indexer over time • Helps user to avoid false drops
Vocabulary control (4) Syndetic structure • Broader terms • Narrower terms • Related terms (see also) • User can negotiate structure to find most appropriate term, as well as identify additional related terms of potential use in finding relevant documents
Social network indexing • Tags • Tag clouds • User-created tags providing access to library resources
flickr http://www.flickr.com/
Tags architecture Bohemian South Country Czech Republic Europe European historical medieval old Old Town Other Keywords River Snow town Vltava Tags
Tags (177,583 photos)
http://www.delicious.com/mauicclibrary technology The economic case for open access in academic publishing Portable software for USB drives CU Researcher Finds 10,000-Year-Old Hunting Weapon in Melting Ice Patch
University of Pennsylvania http://www.library.upenn.edu/
Adding a PennTag Add to PennTags