630 likes | 643 Views
Developing a Metadata Infrastructure for Information Access: What, Where, When and Who?. Prof. Ray R. Larson University of California, Berkeley School of Information. Overview. Metadata as Infrastructure What, Where, When and Who? What are Entry Vocabulary Indexes? Notion of an EVI
E N D
Developing a Metadata Infrastructure for Information Access:What, Where, When and Who? Prof. Ray R. Larson University of California, BerkeleySchool of Information
Overview • Metadata as Infrastructure • What, Where, When and Who? • What are Entry Vocabulary Indexes? • Notion of an EVI • How are EVIs Built • Time Period Directories • Mining Metadata for new metadata • 4W Demo • New Project: Bringing Lives to Light
Metadata as Infrastructure • The difference between memorization and understanding lies in knowing the context and relationships of whatever is of interest. When setting out to learn about a new topic, a well-tested practice is to follow the traditional “5Ws and the H”: Who?, What?, When?, Where?, Why?, and How?
Metadata as Infrastructure • The reference collections of paper-based libraries provide a structured environment for resources, with encyclopedias and subject catalogs, gazetteers, chronologies, and biographical dictionaries, offering direct support for at least What, Where, When, and Who. • The digital environment does not yet provide an effective, and easily exploited, infrastructure comparable to the traditional reference library.
What? Searching texts by topic, e.g. Dewey, LCSH, any subject index, or category scheme applied to documents. Two kinds of mapping in every search: • Documents are assigned to topic categories, e.g. Dewey • Queries have to map to topic categories, e.g. Dewey’s Relativ Index from ordinary words/phrases to Decimal Classification numbers. Also mapping between topic systems, e.g. US Patent classification and International Patent Classification.
‘What’ searches involve mapping to controlled vocabularies Thesaurus/ Ontology Texts
Building a Search Term Recommender Start with a collection of documents.
Index Classify and index with controlled vocabulary Or use a pre-indexed collection.
Problem:Controlled Vocabularies can be difficult for people to use. For: “Wirtschaftspolitik” In Library of Congress subj Index Use: “Economic Policy” “pass mtr veh spark ign eng”
pass mtr veh spark ign eng” = “Automobile” Solution:Entry Level Vocabulary Indexes. Index EVI
“What” and Entry Vocabulary Indexes • EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…
Domains to select from: Engineering, Medicine, Biology, Social science, etc. Has an Entry Vocabulary Module been built? User selects a subject domain of interest. Use an existing EVI. YES User has question but is unfamiliar with the domain he wants to search. NO Extract terms (words and noun phrases) from titles and abstracts. Build associations between extracted terms & controlled vocabularies. Download a set of training data. Map user’s query to ranked list of controlled vocabulary terms For noun phrases User selects search terms from the ranked list of terms returned by the EVI. Part of speech tagging Internet DB indexed with a controlled vocabulary. Building an Entry Vocabulary Module (EVI) Searching Building and Searching EVIs
Extract terms (words and noun phrases) from titles and abstracts. Build associations between extracted terms & controlled vocabularies. Download a set of training data. Part of speech tagging Technical Details For noun phrases Internet DB indexed with a controlled vocabulary. Building an Entry Vocabulary Module (EVI)
Association Measure C ¬C t a b ¬t c d Where t is the occurrence of a term and C is the occurrence of a class in the training set
W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p) and p1= p2= p= a a+b c c+d a+c a+b+c+d Vis. Dunning Association Measure • Maximum Likelihood ratio
Alternatively • Because the “evidence” terms in EVIs can be considered a document, you can also use IR techniques and use the top-ranked classes for classification or query expansion
In Arabic Chinese Greek Japanese Korean Russian Tamil Find Plutonium Digital library resources Statistical association
EVI example Index term:“pass mtr veh spark ign eng” EVI 1 User Query “Automobile” Index term:“automobiles” OR “internal combustible engines” EVI 2
But why stop there? Index EVI
“Which EVI do I use?” Index EVI Index EVI Index EVI Index
EVI to EVIs Index EVI EVI2 Index EVI Index EVI Index
In Arabic Chinese Greek Japanese Korean Russian Tamil Find Plutonium Why not treat language the same way?
Support for the Learner with a Query Facet Vocabulary Displays WHATThesaurus Cross- e.g. LCSH references WHEREGazetteer Map WHENPeriod directory Timeline WHOBiograph. dict. Personal e.g. Who’s Who relations Any catalog: Archives, Libraries, Museums, TV, Publishers Any resource: Audio, Images, Texts, Numeric data, Objects, Virtual reality, Webpages
It is also difficult to move between different media forms Thesaurus/ Ontology Texts EVI Numeric datasets
Searching across data types • Different media can be linked indirectly via metadata, but often (e.g. for socio-economic numeric data series) you also need to specify WHERE to get correct results
But texts associated with numeric data can be mapped as well… Thesaurus/ Ontology Texts EVI EVI captions Numeric datasets
But there are also geographic dependencies… Thesaurus/ Ontology Texts EVI EVI Maps/ Geo Data captions Numeric datasets
WHERE: Place names are problematic… • Variant forms: St. Petersburg, Санкт Петербург, Saint-Pétersbourg, . . . • Multiple names: Cluj, in Romania / Roumania / Rumania, is also called Klausenburg and Kolozsvar. • Names changes: Bombay Mumbai. • Homographs:Vienna, VA, and Vienna, Austria; • 50 Springfields. • Anachronisms: No Germany before 1870 • Vague, e.g. Midwest, Silicon Valley • Unstable boundaries: 19th century Poland; Balkans; USSR • Use a gazetteer!
WHERE. Geo-temporal search interface. Place names found in documents. Gazetteer provided lat. & long. Places displayed on map. Timebar
Zoom on map. Click on place for a list of records. Click on record to display text.
So geographic search becomes part of the infrastructure Thesaurus/ Ontology Texts EVI Maps/ Geo Data Gazetteers captions Numeric datasets
WHEN: Search by time is also weakly supported… • Calendars are the standard for time • But people use the names of events to refer to time periods • Named time periods resemble place names in being: • Unstable: European War, Great War, First World War • Multiple: Second World War, Great Patriotic War • Ambiguous: “Civil war” in different centuries in England, USA, Spain, etc. • Places have temporal aspects & periods have geographical aspects: When the Stone Age was, varies by region
Linking vocabularies WHAT, WHERE, WHEN Library subject headings Topic – Geographic subdivision – Chronological subdivision Place name gazetteer: Place name – Type – Spatial markers (Lat & long) – When Time Period Directory Period name – Type – Time markers (Calendar) – Where Vocabularies are the key! Want: Kung-fu movies? Use LCSH: Hand-to-hand fighting, oriental, in motion pictures.
Time period directories link via the place (or time) Thesaurus/ Ontology Texts EVI Maps/ Geo Data Gazetteers captions Numeric datasets Time Period Directory Time lines, Chronologies
WHEN: Time Period Directory Timeline Link to Catalog Link to Wikipedia
Life events metadata WHAT: Actions prisoner WHERE: Places Holstein WHEN: Times 1261-1262 WHO: People Margaret Sambiria Need external links WHO: Biographical Dictionary Complex relationships
Any document, object, or performance Connect it with its context – and other resources. Facet Vocabulary Displays WHATThesaurus Cross- e.g. LCSH references WHEREGazetteer Map WHENPeriod directory Timeline WHOBiograph. dict. Personal e.g. Who’s Who relations Any catalog: Archives, Libraries, Museums, TV, Publishers Any resource: Audio, Images, Texts, Numeric data, Objects, Virtual reality, Webpages
Entry Vocabulary Index suggests correct LCSH with different spelling
Zooming in to South Asia Select Restricting time frame
Berkeley Natural History Museums BBC Ethnologue Wikipedia CIA Factbook More information about the country of India…
Historical events – linked to Library catalog & Wikipedia : none avail. for this time period
ECAI Cultural Atlases: presenting history in its geographical & chronological contexts