Synthesized Audio Descriptions

Hironobu Takagi, Chieko AsakawaIBM Research – Tokyo Synthesized Audio Descriptions

IBM History of Accessibility 1960sTalking Typewriter 19751403 Braille Printer 1984Talking 3270 Terminal 1988ScreenReader/DOS 1990VoiceType™ 1960s Talking Typewriter 1984 Talking 3270 Terminal 1994Screen Magnifier™/2 1997Home Page Reader 1998ViaVoice® 2000Accessibility Center 2004aDesigner 2007aiBrowser for Multimedia 2007Eclipse Accessibility Tools Framework 1999 Home Page Reader Japanese, Italian, French, German, Spanish, US English, UK English 2008Social Accessibility 2009ARIA (Accessible Rich Internet Application)

Status of Audio Descriptionsin Japan 0.9% 12.0% Movies Ratio of Japanese movie with Audio Descriptions Ratio of Japanese movies with Captions (2008) from NPO Media Access Support Center Public TVPrivate Public TVPrivate TV 49.4%, 42.3% Ratio of TV Programs with captions (2008)(*1) 5.6%, 0.4% Ratio of TV Programs with Audio Descriptions (2008) (*2) TV *1 :Ministry of Internal Affair and Communication (2008) *2 :NICT: National Institute of Information and Communications Technology

Captions and Audio Descriptions for TV Programs based on data from MIC and NICT

Problems: Workload and Cost Audio descriptions Captions • Recording an audio description calls for a skilled narrator and a good recording environment. • Writing an audio description script requires special expertise to describe the scenes between dialogues and scene changes. Recording Workload Transcribing Transcribing

History of Text-to-speech Engines 1980 1990 2000 2010 1985 IBM 1996ProTalker(IBM) 2004Super Voice (IBM) 2008Emotional TTS (IBM) 1983年DecTalk 2004Super Voice (IBM)

Possible Reduction of Workload Synthesized audio descriptions Current audio descriptions Recording Recording Reduction by Synthesis Workload Transcribing Reduction by Tool support Transcribing

Acceptance Ratio (United States) • MethodOnline Survey • Participants236 （39 low-vision, 197 blind） • GenreEducation and documentary • Voice qualityHuman and TTS（Heather） 100% 90% 80% Uncomfortable 70% Slightly Uncomfortable 60% Neutral 50% 40% Acceptable 30% Comfortable 20% 10% 0% Set 1 Set 2 Set 3 Set 4 Constantly 70%～80% answered more than neutral 視聴覚障害者向け音声ガイド・字幕記述のための標準仕様の研究開発

Video Accessibility Project: Goals • Prove feasibility of text-based audio descriptions via user studies. • Work with professional teams for audio descriptions • Japan – IBM with CAP and content from NHK • U.S. - WGBH • Create an open source platform for audio descriptions and captions • Authoring tools and players • Captions and text-based audio descriptions • Based on Eclipse.org Accessibility Tools Framework (ACTF) • Contribute to standardization of Internet media accessibility • Focus on “missing markups” in the existing standards. • Maintain neutrality for existing standards. • HTML5 is the primary target. Supported by the Japanese government agency NICT (National Institute of Information and Communications Technology)

Thank you!

ACTF Script Editor • Authoring tool, specialized for audio descriptions. • Flexible to import and export various formats. • Planned for release as open source in March.

Case of the audio guide for the museum / the stage • Museums : There are many actual usage of audio guide in museum and art museum.（The main purpose of audio guide is not to support person with visually impaired but to help everyone for studying the contents.) • [for example : provider of audio guide] • National Museum of Nature and Science,Tokyo • The National Museum of Western Art • Hiroshima Museum of Art • Osaka Museum of Natural History • Tokyo Museum of Fire Department • Shimane Museum of Ancient Izumo. • Almost every museum in Japan provides audio guide. • Generally, audio guide equipment is specially designed and made with prerecorded voice by manufacture. There is a new approach for using NINTENDO DS and downloading the content in it at the museum. • The stage : Mini-drama group is main. • [for example : provider of audio guide] • Drama group "Bakkari-Bakkari" provides audio guide once in a performance period. • A drama group in the city of Kawasaki, Kanagawa Pref. • A drama group "DORA" • About caption, for example, SHIKI THEATRE COMPANY provides caption. There is very few case that large-scale theatre play provides audio guide.

Laws and Regulations • 1993 Act on Advancement of Facilitation Program for Disabled Persons' Use of Telecommunications and Broadcasting Services, with a View to Enhance Convenience of Disabled Persons (1993) • 1997 MIC defined a goal to “provide captions to all TV programs by 1997” • 1998 BROADCAST LAW • Article 3-2 (4) • Any broadcaster shall, in compiling the broadcast programs for domestic broadcasting, provide as many broadcasting programs as possible which provide voices and other sounds to explain about transient images of fixed or moving objects for blind persons, and providing characters or patterns to explain about voices and other sounds for deaf persons. • 2007 Signed the “Convention on the Rights of Persons with Disabilities” • 2010 New JIS (Japanese Industrial Standard) for Web Accessibiltiy • Technical guidelines are fully harmonized with WCAG 2.0

ACTF aiBrowser 1 Direct audio control • Allow users to increase or lower the volume, stop or play, and control audio speed by using simple keyboard commands. 2 User interface simplification • Structurally simplify interfaces by converting dynamic visual interfaces into static text-based interfaces • Dynamically add alternative texts to images and buttons 3 Audio descriptions with text • Infrastructure to provide video descriptions at low cost 14

Status of Audio Descriptionsin Japan 0.9% 12.0% Movies Ratio of Japanese movie with Audio Descriptions Ratio of Japanese movies with Captions (2008) from NPO Media Access Support Center Public TVPrivate Public TVPrivate TV 49.4%, 42.3% Ratio of TV Programs with captions (2008)(*1) 5.6%, 0.4% Ratio of TV Programs with Audio Descriptions (2008) (*2) TV *1 :Ministry of Internal Affair and Communication (2008) *2 :NICT: National Institute of Information and Communications Technology 0.2% 0.0% Internet Ratio of video content with captions in the Open Courseware project. (2 among 1,474) Popular video sharing services and educational online videos, but no videos with audio descriptions (except for videos prepared as examples of audio descriptions). Team investigation

Analysis of Standards and Possible Focus Layer of Markups (vocabulary lists) for text-based audio descriptions Association with video contents, multilingual, etc. Mozilla <itext>, etc. W3CSMIL Index structure for video (Scenes and chapters, etc.) Each video format has its own specifications. (DVD, MPEG, etc.) Personalization FOCUS AREA! Unique for audio descriptions (extended, audio control, block, etc.) W3C SSML, etc. W3C Emotion ML Voice styles and emotional expressions Description (textual information) SRT W3C TT DFXP Flexible addressing Addressing (timing)

2nd study: Level of Description Rate of correct answers for each level of description heard once or twice 30% Using the extended description and listening twice both improved the comprehension.

Difficulties in Online Videos News Entertainment E-Learning Now is the time to create a new technical framework for audio descriptions! Consumer-Generated Videos Historical Videos

Prior Projects • e-Inclusion project in Canada supported by Canadian Heritage. • CRIM (Centre de recherche informatique de Montréal) • Four-year project completed this year • Authoring tool and playback tool • LiveDescribe by Ryerson University • Community-based authoring system • Authoring tool and playback tool • NHK Research • Prototyped and tested TTS-based audio descriptions • aiBrowser • Developed by IBM Research and contributed to Eclipse.org • Audio descriptions with Flash, QuickTime, and Windows Media Player • Other trials • HTML5 + Live Region demo (Firefox team) • WebShake • Japanese online caption provider prototyped with TTS-based audio descriptions. • ACAV, etc.

Distribution Flexibility Voice quality Authoring cost System cost Human voice (current model) Audio High High High Human narrator Audio Pre-recorded synthesized audio Audio Low* Low High** Text Synthesizer Audio Server-side synthesizer Text Low Low* High Synthesizer Audio Client-side synthesizer Text Lowest Low Low*** Synthesizer Text * Server-side synthesis is better than client-side synthesis. *** Client-side software support is required. ** The systems for human voices can be reused.

Experimental Results (Japan) • 1st study (Sep 2009) • 3 blind or visually impaired participants • Face-to-face, one-to-one sessions • Focused on the voice quality, level of description, and speech speed • 2nd study (Feb 2010) • 24 blind or visually impaired participants • Face-to-face, small group sessions • Consisted of 4 sub-studies for long-term listening, expressive voices, describer expertise, and level of description

日本における字幕・音声ガイドの現状 0.9% 12.0% 映画 2008年に公開された邦画のうち副音声が提供されていた割合 2008年に公開された邦画のうち字幕が提供されていた割合 2008年に公開された邦画が対象 NPO Media Access Support Center資料より NHK総合在京民放 NHK総合　　　　　　　　　在京民放 49.4%, 42.3% 平成20年度の総放送時間に占める字幕放送時間の割合 (*1) 5.6%, 0.4% 平成20年度の在京キー局の地上波における解説放送の割合(*2) 放送 *1 :総務省「平成20年度の字幕放送等の実績」報道資料より *2 :NICT: National Institute of Information and Communications Technology資料より 0.2% 0.0% インターネットオープンコースウェア（教育用コンテンツ）における字幕付与率。1417本中2本。主要な動画配信サイト、教育用コンテンツのサンプリング調査の結果、音声ガイドの付与された動画は見つからなかった。本プロジェクト内での独自調査視聴覚障害者向け音声ガイド・字幕記述のための標準仕様の研究開発

1st study: Results The descriptions greatly improved the user experience regardless of the voice quality. The participants’ comments indicated that Modern TTS was almost comparable to a human voice though the human was still preferred.

2nd study: Sub-studies • Long-term listening • Assess if TTS-based descriptions are acceptable for listening to full-length programs • Target videos: cartoon (comedy), drama (tragedy), documentary • Expressive voices • Determine if the expressive TTS improves the user experience • Target videos: cartoon (comedy), drama (tragedy) • Describer expertise • Assess how the describer expertise affects understanding • Target video: public service announcement (warning about fraud) • Level of description • Assess how the level of description and repetitive listening affects understanding • Target video: instructional program (how to fold and store clothing)

2nd study

2nd study: Long-term Listening Effectiveness scores for each video category TTS-based descriptions were generally acceptable for full-length programs From comments, the documentary film received the highest evaluation, but that was not clear from the effectiveness scores.

2nd study: Describer Expertise Effectiveness scores for each describer expertise and level of description Expert (Normal) 12 Expert (Extended) 9 Novice (Normal) Novice (Extended) 6 Frequency 3 0 1 2 3 4 5 Score Novice (Normal) was not preferred (score: 3.0) Novice (Extended) was comparable (score: 4.3) to expert descriptions (score: 4.3 for normal, 4.6 for extended)

Typical Client-side TTS Setting Online Video Script Editor Video Player Refer Browse Website Audio Description Script Post Fetch Metadata Repository

W3C Web Contents Accessibility Guidelines 2.0 (2008年12月勧告) 1.2.5 収録済の映像コンテンツの音声ガイド (レベルAA) 1.2.7 収録済の映像コンテンツの拡張した音声ガイド (レベル AAA) 日本改正著作権法(2009年6月成立2010年1月1日施行) 日本 JIS X 8341-3:2010 (2010年6月ごろ公示予定) 視聴覚障害者向け音声ガイド・字幕記述のための標準仕様の研究開発

Synthesized Audio Descriptions

Synthesized Audio Descriptions

Presentation Transcript

Speech Synthesized Temperature Sensor

job descriptions

General descriptions

Descriptions

SUMMARY - Descriptions

Behavioral Descriptions

PPy:PLGA films synthesized

Descriptions

Dryer Descriptions

Descriptions…

JOB DESCRIPTIONS

Descriptions

Descriptions

Descriptions

Descriptions

Zeea Olives Synthesized

Customized Synthesized Peptide

Descriptions