250 likes | 268 Views
Explore how the Center for Visual History Malach serves eHumanities by providing access to the USC VHI Archive with innovative technology. Learn about the activities, access methods, and future plans for accessing testimonies of Holocaust survivors.
E N D
Language Technology Research Serving eHumanitiesNew Ways of Accessing the USC Shoah Foundation Archive in the Center for Visual HistoryMalach Jan HajičInstitute of Formal and Applied LinguisticsComputer Science SchoolCharles University in Prague, Czech Republicmalach@knih.mff.cuni.cz | http://www.malach-centrum.cz
From Testimonies to Flexible Access • The USC VHI Archive • Testimonies of Holocaust survivors • Center for Visual History Malach • Access Point to the USC Archive • Activities of CVHM • Access using New Technology • Fulltext (transcript) Search • Cross-lingual Access • Thesaurus Translation • Status and Future Plans
Center for Visual History Malach • Access Point to the USC VHI’s Archive, http://www.usc.edu/vhi
Contents of the Archive • Testimonies recorded in the 1990s
Recording the Testimonies • Visual History Foundation • California, USA (Universal Studios) • 1990s • Analog video recording technology, 30 minute tapes • Teams of 3 people (moderator, video, audio) • Volume • 56 countries, over 105,000 hours of video • 32 languages, ~52,000 testimonies • Half of them in English
Archiving the Testimonies • Digitization • 100s of terabytes of data (NTSC/PAL quality) • Catalogization (indexing) • Thesaurus (55000 keywords) • hierarchical, timeline, places • Goal: • Access (search) • Material for projects
Access (Search) • Search by keywords • At 1-minute segments, beginning of topic • Search by particular people, relations • Filter search by • Language spoken • Country of survivor • Experience (survivor/liberator/...) • Not possible: “fulltext” search • Video access: locally available, or on order • Player: usual controls, also by segment, search within video
Access Points • Internet: only limited access so far • Throughput (technical limitations), legal & ethical issues, ... • → Access Points • ~30 worldwide (USA; EU: Berlin, Budapest, Prague, Warsaw; secondary access available) • 2 - 20% of full archive locally • Fast “Internet2” connection • Additional Services • Search and view: standard Internet browser
Center for Visual History Malach • Charles University in Prague, est. 2009, coordinator: Jakub Mlynář
Center for Visual History Malach • Supported by Charles University • Faculty of Mathematics and Physics, CS School • CS School Library & Institute of Formal and Applied Linguistics • Part of LINDAT-Clarin, Language Data Infrastructure • Clarin ERIC – Pan-European network of LTH Centers • 12 workplaces, AV technology, materials • Technology (by Inst. of Formal and Applied Linguistics): • 1 Gbit network locally, dataserver (for video cache) • 2000 testimonies locally (all Czech, Slovak, Polish, many in English) • Geant connection, 5-10 min. for 30 min. video from USC
Center for Visual History Malach: Activities • Seminars • Anniversary seminar (January) • Seminars for students, teachers • Also: foreign visitors (Ukraine – summer 2012) • Workshops • Co-organization of Raoul Wallenberg 100th Anniversary workshop, Nov. 2012 • w/Czech Parliament, Jewish Museum in Prague, Embassies • Tutorials • Using the Archive, How-To-...; Research on Language Technology (with Institute of Formal and Applied Linguistics)
Center for Visual History Malach: Activities • Newsletter Web:
Center for Visual History Malach Visitors Students Teachers, Researchers Journalists, Writers, Filmmakers Other (personal reasons, etc.) mid-2010 – fall 2012
Why “Malach”? • Technology and UI Research Project • 2002-2007 • Multilingual Access to Large Audio arCHives • malach – “angel” in Hebrew • Support: NSF (National Science Foundation) • Visual History Foundation (predecessor of SFI/USC) • IBM Research, Yorktown Heights, NY, USA • Johns Hopkins Univ., Baltimore, MD, USA • Univ. of Maryland, College Park, MD, USA • Charles University in Prague, CZ (IFAL MFF UK) • Univ. of West Bohemia, Pilsen, CZ (Dept. of Cybernetics)
Research in the Malach Project • Research in the area of • Automatic Speech Recognition (of the testimonies) • English, Czech, Slovak, Russian, Polish, Hungarian • Automatic Translation of Thesaurus • Keyword translation • Czech, English • Cross-lingual Audio/Voice Search • Part of the world-wide CLEF 2006, 2007 competition • User interfaces → current VHA search interface
Automatic Speech Recognition • Core “Front-end” Technology • Current State-of-the-Art: 95% in controlled conditions • Problems: • English: non-native speakers (virtually all 26,000!) • Czech: colloquial speech • All: emotions, elderly people, imperfect recording • Technology issues: not enough in-domain texts • Some improvement reached by 2007
The AMALACH Project • Applied research project, 2012-2015 • Implement and integrate (some) MALACH project results • Czech National Cultural Heritage Funding • Partners: Charles Univ., Univ. of West Bohemia (and USC) • Selling point: improved access for local (Czech) researchers • USC Archive: 558 Czech-language testimonies • only a fraction (~ 12%) of 4613 Czech survivors! • Rest: mostly English spoken • Also: 12500 segments containing keyword “Czech” • Solution: cross-lingual fulltext-like search • Needs speech recognition, automatic translation, thesaurus
Cross-lingual Search Scheme • Archive transcript & query translation Translation to E Query in A Transcr.Z . . . C B Transcr.A Seg. 1 in A Seg. 2 in A … Seg. N in A Seg. 1 in B Seg. 2 in B … Query in E ASR in multiple lang. Monolingual Search Translation to E The archive: all audio Archive Transcript, E USER QUERY PROCESSING [OFFLINE] Mar. 7, 2012 UFAL Intro 20
Phonetic and Word Search(monolingual) • Automatic Speech Recognition (Univ. of WB) Word and Phonetic Lattice Transcript Database Search System Automatic Speech Recognition VHF04106-0047.18 VHF04167-0146.32 VHF05103-0192.98 ………………
Machine Translation • State-of-the-Art • Cf. Google (currently best for most language pairs) • Still imperfect (applications need varying levels of quality) • Machine translation of speech transcripts • Big challenge: VERY noisy input - • Speech recognition errors • Ungrammatical, non-native, emotional language • Good news • Used in search only (will probably never be shown to users)
Statistical Machine Translation Technology • The idea (1940s/1990s) - imagine this: • Translation by the reverse process: “decoding” • Probabilistic model of the translation process • And probabilistic model of the target language • Probabilities learned from (human) translations Czech text English text “Coding”
Speech and Language Technology in Search Translation to E Query in A Transcr.Z . . . C B Transcr.A Seg. 1 in A Seg. 2 in A … Seg. N in A Seg. 1 in B Seg. 2 in B … Query in E ASR in multiple lang. Monolingual Search Translation to E The archive: all audio Archive Transcript, E USER QUERY PROCESSING [OFFLINE] Mar. 7, 2012 UFAL Intro 24
Status and Future Plans • Czech testimonies • Monolingual Fulltext Search System operational • in CVHM, users can use both VHA and the UWB UI • English speech recognition of the testimonies • Work has started: data preparation ongoing • Translation to Czech • Thesaurus: manually (high quality necessary) • Will be used in the current interface as well • Data: work ongoing, data preparation • “Lattice” translation experiments underway • Cross-lingual search: work starts in 2013
Thank you! • VHI http://www.usc.edu/vhi • Institute of formal and applied linguistics http://ufal.mff.cuni.cz • Center for Visual History Malach http://malach-centrum.cz • Dept. of Cybernetics, Univ. of West Bohemia, Pilsen, CZ http://www.kky.zcu.cz • The project “Malach” http://malach.umiacs.umd.edu
Closing • Presented at Preserving Survivors’ Memories Digital Testimony Collections about Nazi Persecution History, Education and Media Wednesday, Nov 21, 2012 11:00 Section A http://www.preserving-survivors-memories.org