250 likes | 258 Views
Explore a library project integrating text and numeric database searches, from articles on Vietnam imports to access to MELVYL and Counting California databases.
E N D
Seamless Searching of Numeric and Textual ResourcesFunded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael Buckland, Aitao Chen, Fredric Gey and Ray Larson Friday Afternoon Seminar, Feb 14, 2003 http://metadata.sims.berkeley.edu/papers/SeamlessSearchFinalReport.pdf
From numbers to texts: An article found using the keywords “Import” and “Vietnam” as query. Iritani, Evelyn. "Normalizing ties to Vietnam important steps for U.S. firms; California stands to profit handsomely when barriers fall to trade with fast-growing country." Los Angeles Times v114 (July 12, 1995):D1.
From text to numbers: Topic of interest: imports of beef to the United States from Britain "U.S. bans import of most European meat". Los Angeles Times v116, n14 (Dec 14, 1997):A22. (On fear of mad cow disease.)"Ban on cattle and sheep is extended to all Europe." New York Times v147, sec1 (Dec 14, 1997):16(N), 42(L). (The U.S. Agriculture Department responds to threat of 'Mad Cow' disease). The sources at http://govinfo.kerr.orst.edu/import/import.html show No reported edible beef imports from the United Kingdom.
Seamless Search Project Goals: • Phase I: The development and demonstration of a library gateway providing search support for searching both text and socio-economic numeric databases. • Phase II: The demonstration of a library gateway supporting searches between text and numeric database.
Data Sets to create Entry Vocabulary Indexes: MELVYL MARC Files Book title A sample training record extracted from a MARC record. <RECORD> <001> 73180254 </001> <245><a>A study of operant conditioning under delayed reinforcement in early infancy</a></245> <650><a>Infant psychology.</a></650> <650><a>Operant conditioning.</a><650> </RECORD> LC Subject Headings Number of MARC records in the training data set: ~4,246,000.
Title Words Doc IDs LCSHs attitude doc1 Infant development baby doc2 Infant psychology behavior doc3 child Operant conditioning doc4 development Parent and child infant doc5 Psychology infancy Statistical association of title words and LCSH psychology
Word to LCSH Entry Vocabulary Index (EVI) List of the LCSHs that are most closely associated, statistically, with the query word: alcoholism. Rank LCSH Weight • alcoholism 7470.46 • alcoholic 1745.23 • alcohol 709.26 • alcoholism and employment 318.26 • drug abuse 257.75 • alcohol, ethyl 235.13 • drinking of alcoholic beverages 151.46 • substance abuse 146.04
Words to LCSH Entry Vocabulary Index (EVI) List of LCSHs that are most closely associated, statistically, with the German query word: Wirtschaftspolitik. Rank LCSH Weight • economic policy 756.90 • german (west) 645.02 • switzerland 97.70 • regional planning 96.39 • economics 92.14 Note: The top-ranked LCSH “economic policy” happens to be the English translation of the German word “Wirtschaftspolitik”.
Words to LCSH Entry Vocabulary Index (EVI) List of LCSHs that are most closely associated, statistically, with the phrase peanut butter as a query. Rank LCSH Weight • peanut 1343.90 • cookery (peanut butter) 429.61 • cookery (peanuts) 423.47 • peanut industry 359.57 • peanut butter 316.23 • butter 309.36 • schulz, charles m 277.30 • cookery 197.08
Word to LCSH Entry Vocabulary Index (EVI) List of LCSHs that are most closely associated with the German query: Vietnam War. Rank LCSH Weight • world war, 1939-1945 16430.62 • vietnamese conflict, 1961-1975 15388.68 • united states 13989.66 • world war, 1914-1918 8055.60 • vietnam 6523.90 Note: “Vietnam War”is not an established (authorized) LCSH. The established LCSH is “Vietnamese conflict”.
LCSH to Words Entry Vocabulary Index List of words that are most closely associated, statistically, with the Library of Congress Subject Heading: Alcoholism. Rank Words Weight • alcohol 13471.94 • alcoholism 11715.56 • abuse 3708.09 • drug 3467.22 • drink 2563.53 • alcoholic 2534.91 • treatment 2349.03 • prevention 1263.94 • problem 1148.03 • addiction 886.81
EVI-based Access to MELVYL Web Browser Free-form query Ranked list of LCSHs Search results Full MARC record HTTP 1 3 6 7 httpd Web server CGI evi access gateway access 4 HTTP/Z39.50 Gateway 2 EVI 5 Z39.50 MELVYL Z39.50 SERVER Other Z39.50 SERVERS
Counting California Database(http://countingcalifornia.cdlib.org/) • A collection of some 3,000 numeric tables. • Organized into 16 topics and 184 subtopics. Sample topics: • Banking, Finance and Insurance • Elections • Population and Demographics • Social Services and Public Assistance Sample subtopics underAgriculture and Natural Resources: • Farms and Farming • Fishing • Forestry and Lumber • Minerals
Enhanced Access to Counting California Database • Conventional probabilistic retrieval of numeric tables using table captions, mapping query to text of captions. • Access to numeric tables through the words-to-subtopic entry vocabulary index. A sample record created from http://countingcalifornia.cdlib.org. <table> <topic> education </topic> <subtopic> libraries </subtopic> <caption>STATISTICS, STATEWIDE SUMMARY BY TYPE OF LIBRARY CALIFORNIA, 1992-93 TO 1997-98</caption> </table>
Probabilistic Access to Counting California Database Search results for the query: public libraries in California gives ranked list of captions:
EVI-based Access to Counting California Database Ranked list of subtopics that are most closely associated, statistically, with the query: personal/individual income tax. • income 542.53 • government earnings and tax revenues 251.71 • property tax 156.67 • property tax 74.58 • personal income tax 59.99
Traverse Searching Between Online Catalogs and Numeric Databases 1 2 3 4 search interface 1 online catalog EVI LCSH 9 10 5 numeric table search results captions 11 8 7 6 search interface 2 numeric database new query marc
Extract from MARC as a query Any caption can become a query
Final Report on “Seamless Searching of Numeric and Textual Resources” Project, 1999-2002. http://metadata.sims.berkeley.edu/papers/SeamlessSearchFinalReport.pdf • Two sequels: • Adding search by place: “Going Places in the Catalog: Improved Geographic Access,” funded by a National Library Leadership Project from the Institute of Museum and Library Services, 2002-2004. • Multilingual Search Across Multiple Genres: Proposal submitted Feb 13, 2003!