1 / 29

Harvesting Translingual Vocabulary Mappings for Multilingual Digital Libraries

spark ign eng

medwin
Download Presentation

Harvesting Translingual Vocabulary Mappings for Multilingual Digital Libraries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:Harvesting Translingual Vocabulary Mappings for Multilingual Digital Libraries

    Slide 2:Overview What are Entry Vocabulary Indexes? EVI Research at Berkeley Notion of an EVI How are EVIs Built Berkeley Multilingual EVI Technology components Database Examples of operation Ongoing research

    Slide 3:Entry Vocabulary Index Research Projects at Berkeley DARPA Information Management Program “Search Support for Unfamiliar Metadata Vocabularies” Institute for Museum and Library Services “Seamless Searching of Numeric and Textual Resources” DARPA TIDES program “Translingual Information Management Using Domain Ontologies” NSF/NASA/DARPA: DLI-2 (IDL) “ Discovery and Use of Textual, Numeric and Spatial Data”

    Slide 4:The IMLS project:

    Slide 5:TIDES Project Translingual Information Detection, Extraction and Summarization Building EVIs to map across languages Using same notion with training data in different languages Using Library of Congress Subject Headings from the CDL MELVYL database

    Slide 6:What is an Entry Vocabulary Index? EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…

    Slide 17:Background on Online Library Catalogs Library catalogs have been automated at a furious pace worldwide since the late ’70s Library objects (books, maps, pictures) in 400+ languages Bibliographic descriptions contain one or more sentences from a particular language (transliterated) Objects have been classified by subject by librarians Library of Congress Subject Heading (Islamic Fundamentalism) Library of Congress Classification (BP60, BP63, KF27) Dewey Decimal Classification (297.2, 306.6, 320.5) International standard (MARC) for catalog metadata Huge number of remotely searchable catalogs worldwide accessible using the international search/retrieve protocol Z39.50

    Slide 18:What can libraries and their catalogs provide? Millions of sentences in multiple languages Sentences with topical content identified from 150,000 Library of Congress Subject Headings Transfer point (interlingua) between English topics and words in other languages Can be used to create: Bilingual dictionaries Query expansion in cross-language information retrieval

    Slide 19:Search: SUBJECT “Islamic Fundamentalism” and LANGUAGE “Arabic”

    Slide 20:Our Training Set and Prototype University of California/CDL MELVYL catalog Private copy, 10 million+ records (5 million non-English) Records in over 100 languages Obtained in MARC database standard format Foreign language titles use Library of Congress transliteration (Romanization) standard Prototype search software maps from/to English and Arabic, Chinese, French, German Italian, Japanese, Russian, Spanish

    Slide 21:Technical Details

    Slide 22:Association Measure

    Slide 23:Association Measure Maximum Likelihood ratio

    Slide 24:Example: Library of Congress Subject Heading “Islamic Fundamentalism” yields most closely associated words in multiple languages

    Slide 25:Non-English words can be mapped to English subject headings

    Slide 26:Examples

    Slide 27:Catalog Languages vs. FBIS Languages  (University of California online catalog. 10 million records)   Approx. language distribution (Berkeley # sentences, FBIS est. # lines source)

    Slide 28:Future Research Add content from other online library catalogs RLIN (>30M records, >900K Chinese, >250K Arabic) COPAC [UK] (9M records, 40k Arabic) Transliteration and back-transliteration for scripted languages Phrase mapping (POS tagging for English, bigram-trigram identification for target languages using mutual information) Further evaluation (TREC, CLEF, NCTIR and local analysis)

    Slide 29:Prototype available http://otlet.sims.berkeley.edu/mulevm2.html

More Related