290 likes | 544 Views
spark ign eng
E N D
Slide 1:Harvesting Translingual Vocabulary Mappings for Multilingual Digital Libraries
Slide 2:Overview What are Entry Vocabulary Indexes?
EVI Research at Berkeley
Notion of an EVI
How are EVIs Built
Berkeley Multilingual EVI
Technology components
Database
Examples of operation
Ongoing research
Slide 3:Entry Vocabulary Index Research Projects at Berkeley DARPA Information Management Program
“Search Support for Unfamiliar Metadata Vocabularies”
Institute for Museum and Library Services
“Seamless Searching of Numeric and Textual Resources”
DARPA TIDES program
“Translingual Information Management Using Domain Ontologies”
NSF/NASA/DARPA: DLI-2 (IDL)
“ Discovery and Use of Textual, Numeric and Spatial Data”
Slide 4:The IMLS project:
Slide 5:TIDES Project Translingual Information Detection, Extraction and Summarization
Building EVIs to map across languages
Using same notion with training data in different languages
Using Library of Congress Subject Headings from the CDL MELVYL database
Slide 6:What is an Entry Vocabulary Index? EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…
Slide 17:Background on Online Library Catalogs Library catalogs have been automated at a furious pace worldwide since the late ’70s
Library objects (books, maps, pictures) in 400+ languages
Bibliographic descriptions contain one or more sentences from a particular language (transliterated)
Objects have been classified by subject by librarians
Library of Congress Subject Heading (Islamic Fundamentalism)
Library of Congress Classification (BP60, BP63, KF27)
Dewey Decimal Classification (297.2, 306.6, 320.5)
International standard (MARC) for catalog metadata
Huge number of remotely searchable catalogs worldwide accessible using the international search/retrieve protocol Z39.50
Slide 18:What can libraries and their catalogs provide? Millions of sentences in multiple languages
Sentences with topical content identified from 150,000 Library of Congress Subject Headings
Transfer point (interlingua) between English topics and words in other languages
Can be used to create:
Bilingual dictionaries
Query expansion in cross-language information retrieval
Slide 19:Search: SUBJECT “Islamic Fundamentalism” and LANGUAGE “Arabic”
Slide 20:Our Training Set and Prototype University of California/CDL MELVYL catalog
Private copy, 10 million+ records (5 million non-English)
Records in over 100 languages
Obtained in MARC database standard format
Foreign language titles use Library of Congress transliteration (Romanization) standard
Prototype search software maps from/to English and
Arabic, Chinese, French, German
Italian, Japanese, Russian, Spanish
Slide 21:Technical Details
Slide 22:Association Measure
Slide 23:Association Measure Maximum Likelihood ratio
Slide 24:Example: Library of Congress Subject Heading “Islamic Fundamentalism” yields most closely associated words in multiple languages
Slide 25:Non-English words can be mapped to English subject headings
Slide 26:Examples
Slide 27:Catalog Languages vs. FBIS Languages (University of California online catalog. 10 million records) Approx. language distribution (Berkeley # sentences, FBIS est. # lines source)
Slide 28:Future Research Add content from other online library catalogs
RLIN (>30M records, >900K Chinese, >250K Arabic)
COPAC [UK] (9M records, 40k Arabic)
Transliteration and back-transliteration for scripted languages
Phrase mapping (POS tagging for English, bigram-trigram identification for target languages using mutual information)
Further evaluation (TREC, CLEF, NCTIR and local analysis)
Slide 29:Prototype available
http://otlet.sims.berkeley.edu/mulevm2.html