1 / 18

Multi-genre Search Using Common Geography

Multi-genre Search Using Common Geography. ICPSR OR Meeting, Ann Arbor Michigan, October 20, 2007 Fredric C. Gey UC Data Archive & Technical Assistance. University of California, Berkeley http://ucdata.berkeley.edu/gey.html ( gey@berkeley.edu )

armani
Download Presentation

Multi-genre Search Using Common Geography

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-genre Search Using Common Geography • ICPSR OR Meeting, Ann Arbor Michigan, October 20, 2007 • Fredric C. Gey • UC Data Archive & Technical Assistance. • University of California, Berkeley • http://ucdata.berkeley.edu/gey.html (gey@berkeley.edu) • Institute for Museum and Library Services Grants: • Seamless search of textual and numeric databases (1999-2002), • Going places in the catalog: Improved Geographic Access (2002-2004), • What Where, When and Why– support for the learner (2004-2006), • Bringing Lives to Light – Biography in Context (2006-2008) last two within Electronic Cultural Atlas Initiative, International and Area Studies • Colleagues: Michael Buckland, Ray Larson, Kim Carl, Jeanette Zerneke, host of students including Vivien Petras Fredric C. Gey Multi-genre Search Using Common Geography

  2. Patents Numeric Statistical Databases HETEROGENEOUS DIGITAL INFORMATION SEARCHCurrent Search Technology (multiple independent searchs without search aids) Bibliography Full Text QUERY Maps and other Geospatial data Music and other media Multi-genre Search Using Common Geography

  3. Searching Statistical Information: One problem statement • Numeric statistical information is often thinly documented or documented with specialized vocabularies • Census: welfare -->public assistance • Foreign trade harmonized commodity classification: computer --> digital adp machine • Standardized Industrial Classification (SIC codes): automobile --> motor vehicle • In search, the user’s ordinary language term is unlikely to match the limited technical vocabulary used to document the statistical resource • Can we remedy this situation? Fredric C. Gey Multi-genre Search Using Common Geography

  4. Searching Statistical Information: Evidence Poor Data • Compared to searching textual databases, numeric statistical information and its terminology is both evidence-poor and highly technical • Statistical databases lack the rich set of textual clues which can identify items of numeric information • However, if we can find a textual resource associated with the numeric data, we may be able to mine the text to improve numeric search. • This is possible for some numeric classification schemes Fredric C. Gey Multi-genre Search Using Common Geography

  5. Searching Statistical Information: Economic Classification Codes • Standard Industrial Classification (SIC) and North American Industrial Classification (NAICS) codes have been used to index trade magazines • This provides a textual resource of hundreds of thousands of documents and millions of words associated with the numeric data. • Thus mappings can be made between the words from the magazines abstracts and the classifications • User queries can be matched against words and phrases most closely associated with the particular numeric data classification • A ranked list of classifications can be displayed to the user in order to improve the search • Harmonized commodity classifications can be searched using SIC codes as a search proxy Fredric C. Gey Multi-genre Search Using Common Geography

  6. SEARCHING UNFAMILIAR METADATA: PROBLEM STATEMENT • Numerous databases are indexed by structured metadata classifications • Classification schemes are highly specialized • As digital libraries multiply in size and diversity, there is a need for search engines for non-specialists • Search engines should translate from ordinary language to specialized classifications In U.S. Import-Export Database “computer” No result found Multi-genre Search Using Common Geography

  7. SEARCHING UNFAMILIAR METADATA: U.S. STANDARD INDUSTRIAL CLASSIFICATION SYSTEM • U.S. Standard Industrial Classification System (SIC) • Used to classify and aggregate industrial activity in the U.S. • Codes defined by Office of Management and Budget • County Business Patterns reports annual employment, payroll, firm size by county, SIC code In U.S. SIC System “Lobster” “Nothing found” Multi-genre Search Using Common Geography

  8. SEARCHING UNFAMILIAR METADATA: ENTRY VOCABULARIES CONSTRUCTED • Ordinary language to U.S. Patent Classification • Ordinary language to INSPEC thesaurus terms • Ordinary language to Library of Congress classification codes • Ordinary language to Standard Industrial Classification system • Ordinary Language to NAICS with link to 1997 Economic Census In U.S. SIC classification For: “Lobster” Try: “Shellfish” Multi-genre Search Using Common Geography

  9. Patents Bibliography Full Text Numeric Statistical Databases Maps and other Geospatial data Music and other media HETEROGENEOUS DIGITAL INFORMATION SEARCHEnhanced Search (augmented with Entry Vocabulary Module (EVM) Technology) EVMp EVMs EVMt QUERYplus EVMg EVMm QUERY Multi-genre Search Using Common Geography

  10. SEARCHING UNFAMILIAR METADATA -- CONCLUSIONS • Entry Vocabulary Technology proved valuable in multiple applications -- • Searching complex classification schemes • Searching numeric data • BUT WE FOUND • NUMERIC DATA IS INTERTWINED WITH PLACE • NIACS Prototype • Need to specify place to retrieve data • In Addition • BOOKS ARE ALSO USUALLY ABOUT PLACE (e.g. History of Tulare County, California). •  NEED UNIFIED SEARCH OF PLACE BETWEEN GENRES Multi-genre Search Using Common Geography

  11. Exogenous Research and Development • 1996-2000 National Science Foundation grant on methods for text retrieval and document ranking -- turned into Cross-Language Information Retrieval research and evaluation participations • TREC English  Chinese 1999-2000, English Arabic 2001-2002 • NTCIR 1999-2007 – Asian language retrieval (Chinese-Japanese-Korean) • CLEF 2000-2007 – European language search and question-answering • English, German, Portuguese, Russian, Spanish • 1999-2003 DARPA Grant “Translingual Information Management Using Domain Ontologies“ -- Hindi surprise language exercise 2003 • CDL’s Counting California project (2000-present), unifying access to statistical data about California • 2005-7 Organization of GeoCLEF – evaluation of Geographic Information Retrieval from multilingual textual sources Multi-genre Search Using Common Geography

  12. UNIFIED SEARCH OF PLACE BETWEEN GENRES • Existing geographic search and display mechanisms for statistical data on the web • University of Virginia (GeoStat Center Historical Census Browser) • Syracuse University (Paul Bern) • We created an additional one interfaced to the ECAI Time Map software, with link to Counting California • Exogenous research created prototype geographic search of news stories in time and space. • Hindi new stories (BBC news in Hindi) • Russian new stories (Izvestia) • These prototypes interface to either text or numeric data, but not both genres Multi-genre Search Using Common Geography

  13. UNIFIED SEARCH OF PLACE BETWEEN GENRES (2) • These above prototype interface to either text or numeric data • Ray Larson has a search interface for world library catalogs • Uses the Z39.50 distributed search protocol • Interfaced to the California Digital Library MELVYL catalog • We connected the two to produce demonstration interfaces to • California counties and cities and towns (2000 census) • US states and counties (1790-1960 census) Multi-genre Search Using Common Geography

  14. Patents Bibliography Full Text Numeric Statistical Databases Maps and other Geospatial data Music and other media HETEROGENEOUS DIGITAL INFORMATION SEARCHGateway Search between Multiple Information Types EVMs EVMp EVMt QUERYplus EVMg EVMm QUERY Multi-genre Search Using Common Geography

  15. Patents Bibliography Full Text Numeric Statistical Databases Maps and other Geospatial data Music and other media HETEROGENEOUS DIGITAL INFORMATION SEARCHDirect Mappings and Search Between Multiple Information Types EVMs EVMp EVMt QUERYplus EVMg EVMm QUERY Multi-genre Search Using Common Geography

  16. New directions in researchBiography markup and search (2006-2008 IMLS grant) • To develop tools for editors, archivists and compilers of historical papers • Emma Goldman papers • To develop display in time/space to facilitate historical discovery, i.e. who lived there at the same time and what important events occurred there • To visualize biography as an ordered sequence of 4-tuple events (activity, time, place, other-people) – developing biographical markup standards • Congressional Biography – automatic markup of place, date, time-range <biog source="cong_dict" page_start="19" page_end="19"> <name> ADAMS, JOHN QUINCY. </name> <text> Born in Braintree, now Quincy, Mass., July 11, 1767. When ten years of age, he accompanied his father to France; and when fifteen, was private secretary to the American Minister in Russia. He was graduated at Harvard University in 1787 ; studied law in Newburyport, and settled in Boston. From 1794 to 1801 he was American Minister to Holland, England, Sweden, and Prussia. He was a Senator in Congress from 1803 to 1808 ; </text> </biog> Multi-genre Search Using Common Geography

  17. New directions in research (continued) Ireland and Irish Studies (2007-2009 NEH Grant) • Collaborative with Center for Digitization, Queens University Belfast • Digitizing >5 million pages of Irish Historical and Cultural Studies • To develop display in time/space to facilitate historical discovery • Recognition of place names in old English and Gaelic • Using Hogan’s Onomasticon Goedelicum locorum et tribuum Hiberniae et Scotiae An index, with identifications, to the Gaelic names of places and tribes (1909 Edmund Hogan, SJ), a kind of concordance of Irish documents by place Multi-genre Search Using Common Geography

  18. REFERENCES • M Buckland and L Lancaster 2004, "Combining Place, Time, and Topic" D-Lib Magazine, May 2004, Volume 10 Number 5 http://www.dlib.org/dlib/may04/buckland/05buckland.html • M Buckland, A Chen, F Gey & R Larson, 2006. “Search Across Different Media: Numeric Data Sets and Text Files.” Information Technology and Libraries. December 2006, pp 181-189. • http://ecai.org/imls2004/imls4w/ • M Buckland, A Chen, F Gey, R Larson, R Mostern & V Petras 2007 ”Geographic Search: Catalogs, Gazetteers, and Maps.” College & Research Libraries (Forthcoming, Sept 2007.) • Emma Goldman papers (http://sunsite.berkeley.edu/Goldman/) • http://www.ucc.ie:8080/cocoon/doi/locus (onomasticon) Multi-genre Search Using Common Geography

More Related