1 / 25

eHumanities Research: New Ways to Access USC Shoah Foundation Archive

Explore how the Center for Visual History Malach serves eHumanities by providing access to the USC VHI Archive with innovative technology. Learn about the activities, access methods, and future plans for accessing testimonies of Holocaust survivors.

Download Presentation

eHumanities Research: New Ways to Access USC Shoah Foundation Archive

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Technology Research Serving eHumanitiesNew Ways of Accessing the USC Shoah Foundation Archive in the Center for Visual HistoryMalach Jan HajičInstitute of Formal and Applied LinguisticsComputer Science SchoolCharles University in Prague, Czech Republicmalach@knih.mff.cuni.cz | http://www.malach-centrum.cz

  2. From Testimonies to Flexible Access • The USC VHI Archive • Testimonies of Holocaust survivors • Center for Visual History Malach • Access Point to the USC Archive • Activities of CVHM • Access using New Technology • Fulltext (transcript) Search • Cross-lingual Access • Thesaurus Translation • Status and Future Plans

  3. Center for Visual History Malach • Access Point to the USC VHI’s Archive, http://www.usc.edu/vhi

  4. Contents of the Archive • Testimonies recorded in the 1990s

  5. Recording the Testimonies • Visual History Foundation • California, USA (Universal Studios) • 1990s • Analog video recording technology, 30 minute tapes • Teams of 3 people (moderator, video, audio) • Volume • 56 countries, over 105,000 hours of video • 32 languages, ~52,000 testimonies • Half of them in English

  6. Archiving the Testimonies • Digitization • 100s of terabytes of data (NTSC/PAL quality) • Catalogization (indexing) • Thesaurus (55000 keywords) • hierarchical, timeline, places • Goal: • Access (search) • Material for projects

  7. Access (Search) • Search by keywords • At 1-minute segments, beginning of topic • Search by particular people, relations • Filter search by • Language spoken • Country of survivor • Experience (survivor/liberator/...) • Not possible: “fulltext” search • Video access: locally available, or on order • Player: usual controls, also by segment, search within video

  8. Access Points • Internet: only limited access so far • Throughput (technical limitations), legal & ethical issues, ... • → Access Points • ~30 worldwide (USA; EU: Berlin, Budapest, Prague, Warsaw; secondary access available) • 2 - 20% of full archive locally • Fast “Internet2” connection • Additional Services • Search and view: standard Internet browser

  9. Center for Visual History Malach • Charles University in Prague, est. 2009, coordinator: Jakub Mlynář

  10. Center for Visual History Malach • Supported by Charles University • Faculty of Mathematics and Physics, CS School • CS School Library & Institute of Formal and Applied Linguistics • Part of LINDAT-Clarin, Language Data Infrastructure • Clarin ERIC – Pan-European network of LTH Centers • 12 workplaces, AV technology, materials • Technology (by Inst. of Formal and Applied Linguistics): • 1 Gbit network locally, dataserver (for video cache) • 2000 testimonies locally (all Czech, Slovak, Polish, many in English) • Geant connection, 5-10 min. for 30 min. video from USC

  11. Center for Visual History Malach: Activities • Seminars • Anniversary seminar (January) • Seminars for students, teachers • Also: foreign visitors (Ukraine – summer 2012) • Workshops • Co-organization of Raoul Wallenberg 100th Anniversary workshop, Nov. 2012 • w/Czech Parliament, Jewish Museum in Prague, Embassies • Tutorials • Using the Archive, How-To-...; Research on Language Technology (with Institute of Formal and Applied Linguistics)

  12. Center for Visual History Malach: Activities • Newsletter Web:

  13. Center for Visual History Malach Visitors Students Teachers, Researchers Journalists, Writers, Filmmakers Other (personal reasons, etc.) mid-2010 – fall 2012

  14. Why “Malach”? • Technology and UI Research Project • 2002-2007 • Multilingual Access to Large Audio arCHives • malach – “angel” in Hebrew • Support: NSF (National Science Foundation) • Visual History Foundation (predecessor of SFI/USC) • IBM Research, Yorktown Heights, NY, USA • Johns Hopkins Univ., Baltimore, MD, USA • Univ. of Maryland, College Park, MD, USA • Charles University in Prague, CZ (IFAL MFF UK) • Univ. of West Bohemia, Pilsen, CZ (Dept. of Cybernetics)

  15. Research in the Malach Project • Research in the area of • Automatic Speech Recognition (of the testimonies) • English, Czech, Slovak, Russian, Polish, Hungarian • Automatic Translation of Thesaurus • Keyword translation • Czech, English • Cross-lingual Audio/Voice Search • Part of the world-wide CLEF 2006, 2007 competition • User interfaces → current VHA search interface

  16. Automatic Speech Recognition • Core “Front-end” Technology • Current State-of-the-Art: 95% in controlled conditions • Problems: • English: non-native speakers (virtually all 26,000!) • Czech: colloquial speech • All: emotions, elderly people, imperfect recording • Technology issues: not enough in-domain texts • Some improvement reached by 2007

  17. The AMALACH Project • Applied research project, 2012-2015 • Implement and integrate (some) MALACH project results • Czech National Cultural Heritage Funding • Partners: Charles Univ., Univ. of West Bohemia (and USC) • Selling point: improved access for local (Czech) researchers • USC Archive: 558 Czech-language testimonies • only a fraction (~ 12%) of 4613 Czech survivors! • Rest: mostly English spoken • Also: 12500 segments containing keyword “Czech” • Solution: cross-lingual fulltext-like search • Needs speech recognition, automatic translation, thesaurus

  18. Cross-lingual Search Scheme • Archive transcript & query translation Translation to E Query in A Transcr.Z . . . C B Transcr.A Seg. 1 in A Seg. 2 in A … Seg. N in A Seg. 1 in B Seg. 2 in B … Query in E ASR in multiple lang. Monolingual Search Translation to E The archive: all audio Archive Transcript, E USER QUERY PROCESSING [OFFLINE] Mar. 7, 2012 UFAL Intro 20

  19. Phonetic and Word Search(monolingual) • Automatic Speech Recognition (Univ. of WB) Word and Phonetic Lattice Transcript Database Search System Automatic Speech Recognition VHF04106-0047.18 VHF04167-0146.32 VHF05103-0192.98 ………………

  20. Machine Translation • State-of-the-Art • Cf. Google (currently best for most language pairs) • Still imperfect (applications need varying levels of quality) • Machine translation of speech transcripts • Big challenge: VERY noisy input - • Speech recognition errors • Ungrammatical, non-native, emotional language • Good news • Used in search only (will probably never be shown to users)

  21. Statistical Machine Translation Technology • The idea (1940s/1990s) - imagine this: • Translation by the reverse process: “decoding” • Probabilistic model of the translation process • And probabilistic model of the target language • Probabilities learned from (human) translations Czech text English text “Coding”

  22. Speech and Language Technology in Search Translation to E Query in A Transcr.Z . . . C B Transcr.A Seg. 1 in A Seg. 2 in A … Seg. N in A Seg. 1 in B Seg. 2 in B … Query in E ASR in multiple lang. Monolingual Search Translation to E The archive: all audio Archive Transcript, E USER QUERY PROCESSING [OFFLINE] Mar. 7, 2012 UFAL Intro 24

  23. Status and Future Plans • Czech testimonies • Monolingual Fulltext Search System operational • in CVHM, users can use both VHA and the UWB UI • English speech recognition of the testimonies • Work has started: data preparation ongoing • Translation to Czech • Thesaurus: manually (high quality necessary) • Will be used in the current interface as well • Data: work ongoing, data preparation • “Lattice” translation experiments underway • Cross-lingual search: work starts in 2013

  24. Thank you! • VHI http://www.usc.edu/vhi • Institute of formal and applied linguistics http://ufal.mff.cuni.cz • Center for Visual History Malach http://malach-centrum.cz • Dept. of Cybernetics, Univ. of West Bohemia, Pilsen, CZ http://www.kky.zcu.cz • The project “Malach” http://malach.umiacs.umd.edu

  25. Closing • Presented at Preserving Survivors’ Memories Digital Testimony Collections about Nazi Persecution History, Education and Media Wednesday, Nov 21, 2012 11:00 Section A http://www.preserving-survivors-memories.org

More Related