Privacy Protection for Life-log Video

Privacy Protection for Life-log Video Jayashri Chaudhari November 27, 2007 Department of Electrical and Computer Engineering University of Kentucky, Lexington, KY 40507

Outline • Motivation and Background • Proposed Life-Log System • Privacy Protection Methodology • Face Detection and Blocking • Voice Segmentation and Distortion • Experimental Results • Segmentation Algorithm Analysis • Audio Distortion Analysis • Conclusions

What is a Life-Log System? “A system that records everything, at every moment and everywhere you go” • Applications include • Law enforcement • Police Questioning • Tourism • Medical Questioning • Journalism • Existing Systems/work • “MyLifeBits Project”: At Microsoft Research • “WearCam” Project: At University of Toronto, Steve Mann • “Cylon Systems”: http::/cylonsystems.com at UK (a portable body worn surveillance system)

Technical Challenges • Security and Privacy • Information management and storage • Information Retrieval • Knowledge Discovery • Human Computer Interface

Why Privacy Protection? • Privacy is fundamental right of every citizen • Emerging technologies threaten privacy right • There are no clear and uniform rules and regulations regarding video recording • People are resistant toward technologies like life-log • Without tackling these issues the deployment of such emerging technologies is impossible

Research Contributions • Practical audio-visual privacy protection scheme for life-log systems • Performance measurement (audio) on • Privacy protection • Usability

Proposed Life-log System “A system that protects theaudiovisual privacyof thepersonscaptured by aportable video recording device”

× Usefulness √ Privacy √ Usefulness × Privacy √ Usefulness √ Privacy Privacy Protection Scheme • Design Objectives • Privacy • Hide the identity of the subjects being captured • Privacy verses usefulness: • Recording should convey sufficient information to be useful

Design Objectives • Anonymity or Ambiguity • The scheme should generate ambiguous identity of the recorded subjects. • Every individual will look and sound identical • Reduce correlation attacks • Speed • Protection scheme should work in real time • Interview Scenario • Producer is speaking with a single subject in relative quiet room

Audio Segmentation Audio Distortion audio video Synchronization & Multiplexing storage Face Detection and Blocking Privacy Protection Scheme Overview S P S: Subject (The person who is being recorded) P: Producer (The person who is the user of the system)

Voice Segmentation and distortion Windowed Power, Pk Computation Pk <TS Pk <TP Y Y Statek=Statek-1 orSubjectorProducer Statek=Producer Storage Statek=Subject Pitch Shifting We use the PitchSOLA time-domain pitch shifting method. * “DAFX: Digital Audio Effects” by Udo Zölzer et al.

Input Audio X1(n) X2(n) X2(n) X2(n) N Sa α*Sa Mixing Max correlation Reduce discontinuity in phase and pitch Km Pitch Shifting Algorithm Pitch Shifting (Synchronous Overlap and Add): Steps 1) Time Stretching by a factor of α using window of size N and stepsize Sa Step 2) Re-sampling by a factor of 1/α to change pitch

camera Face detection is based on Viola & Jones 2001. Face Detection Face Tracking Subject Selection Audio segmentation results Selective Blocking Producer talking Subject talking Face Detection and Blocking

Initial Experiments1 • Analysis of Segmentation algorithm • Analysis of Audio distortion algorithm • 1) Accuracy in hiding identity • 2) Usability after distortion 1: Chaudhari J., S.-C. Cheung, and M. V. Venkatesh. Privacy protection for life-log video. In IEEE Signal Processing Society SAFE 2007: Workshop on Signal Processing Applications for Public Security and Forensics, 2007.

S P P S P S P Silence Transitions  S: Subject Speaking P: Producer Speaking Segmentation Experiment • Experimental Data: • Interview Scenario in quiet meeting room • Three interviews recording of about 1 minute and 30 seconds long

Segmentation Results

Comparison With CMU Segmentation Algorithm CMU audio segmentation algorithm1 used as benchmark 1:Matthew A. Seigler, Uday Jain, Bhiksha Raj, and Richard M. Stern. Automatic segmentation, classification and clustering of broadcast news audio. In Proceedings of the Ninth Spoken Language Systems Technology Workshop, Harriman, New York, 1997.

Speaker Identification Experiment • Experimental Data • 11 Test subjects, 2 voice samples from each subject • One voice sample is used as training and the other is used for testing • Public domain speaker recognition software Script1 This script is used for training the speaker recognition software Train Script2 This script is used to test the performance of audio distortion in hiding the identity Test

Speaker Identification Results Distortion 1: (N=2048, Sa=256, α =1.5) Distortion 2: (N=2048, Sa=300, α =1.1) Distortion 3: (N=1024, Sa=128, α =1.5)

Usability Experiments • Experimental Data • 8 subjects, 2 voice samples from each subject • One voice sample is used without distortion and the other is distorted • Manual transcription (5 human tester) Manual Transcription 1.Wav (transcription1) This transcription is of undistorted voice --- stored in one dot wav file. 2.Wav (transcription2) This transcription is of distorted voice sample --- in two dot wav ---. Unrecognized words

Usability after distortion Word Error Rate: Standard measure of word recognition error for speech recognition system WER= (S+D+I) /N S = # substitution D = # deletion I = # insertion N = # words in reference sample Tool used: NIST tool SCLITE

Extended Experiments • Data set • TIMIT (Texas Instruments and Massachusetts Institute of Technology) Speech Corpora • Experimental Setup • Allowable range of alpha (α): 0.2-2.0 • Five alpha values (α=0.5,0.75,1,1.25,1.40) • Increase the scope of experiments • “Subjective Experiments”: Use testers to access privacy and usability • Privacy Experiments (Speaker Identification)

TIMIT Corpora (630 speakers, 10 audio clips per speaker) Our Experiments (30 speakers, 5 audio clips per speaker) Set E (α=1.40) Set D (α=1.25) Set B (α=0.5) Set A (α=1) Set C (α=0.75) Experimental Setup • Total 30 audio clips in each set • Re-divide the audio clips from each sets into five groups (1-5) • Each group consists of 6 audio clips randomly selected from each set • Each group was assigned to three testers and were asked to do 3 tasks

Subjective Experiments • Task 1: Transcribe audio clips in the assigned group. • Purpose: Determine usability of the recording after distortion • Results: • Metric: WER for each transcription by the tester • Average WER for each clip from 3 testers WER for Speaker with the given alpha(α) value

Average WER per speaker for each alpha value (0-30) (0-60) (0-35) (0-35)

100 22.4 15.3 14.2 14.4 D E B C A Sets Average WER per Set

Statistical Analysis • Z-test calculations • Null Hypothesis: The average WER does not change (from Set A (before distortion) ) after the distortion for a given value of pitch scaling parameter (alpha) • H0: p1 = p2 (Null Hypothesis) Ha: p1 != p2 Z-Test parameters Z-Test Results

Subjective Experiments • Task 2: Identify the number of distinct voices in each subset in the assigned group. • Purpose: Estimate ambiguity created by pitch shifting Results:

Subjective Experiments • Task 3: For each clip from subset of Set A (which is the original un-distorted speech set); identify a clip in other subsets in which the same speaker may be speaking • Purpose: Qualitatively measure the assurance of Privacy Protection achieved by distortion • Results: None of the speakers from set A was identified from other distorted sets. (100% Recognition Error Rate)

Privacy Experiments • Speaker Identification Experiments • ASR tools (LIA_Spk-Det and ALIZE)1 by LIA lab at the University of Avignon • Speaker Verification Tool • GMM-UBM (Gaussian Mixture Model-Universal Background Model) • Single Speaker Independent Background Model • Decision: Likelihood Ratio: 1: Bonastre, J.-F., Wild, F., Alize: a free, open tool for speaker recognition, http://www.lia.univ-avignon.fr/heberges/ALIZE/

Front Processing Feature Extraction (SPRO Tool) Silence Frame Removal (EnergyDetector) Parameter Normalization (NormFeat) 32 coefficients = 16 LFCC + 16 derivative coefficients (SPRO4) Feature Vectors World Modeling Target Speaker Modeling 2 GMM (2048 components) 1: Male 2:Female Warping Initialization Training Adapts a World Model Bayesian Adaptation (MAP) (TrainTarget) (TrainWorld) Speaker Detection (ComputeTest) LIA_RAL Speaker-Det

Experimental Setup • World Model • Number of male speakers = 325 • Number of female speakers = 135 • Target Speaker Model • Number of male test clips = 20 • Number of female test clips = 10 • Two sets of experiments • Same Model: • World Model and Individual Speaker Models: (Training Set: distorted speech with the corresponding alpha) • Cross Model: • World Model and Individual Speaker Models: (Training Set: un-distorted speech)

Privacy Results • Numbers in table is Average rank for the true speakers of the test clips for the corresponding alpha value • Conclusions • Cross Model: Distorted speech, no matter what alpha value is used, is very different from the original speech. • Same Model: Set B and Set C do not provide adequate protection as the rank is still very near the top.

Example Video

Conclusions • Proposed Real time implementation of voice-distortion and face blocking for privacy protection in Life-log video • Analysis of Audio Segmentation • Analysis of Audio Distortion for usability • Analysis of Audio Distortion for privacy protection

Acknowledgment • Prof. Samson Cheung • People at Center of Visualization and Virtual Environment • Prof. Donohue and Prof. Zhang Thank you!

Voice Distortion • Voice Identity • Vocal Track (Formats) : Filters • Vocal Chord (Pitch): Excitation Source • Different ways to distort audio: • Random mixture • Makes the recording useless • Voice Transformation • For example, • More Complex, not suitable for real-time applications • Pitch-shifting • Changes the pitch of voice • Keeps the recording useful • PitchSOLA time-domain pitch shifting method. * “DAFX: Digital Audio Effects” by U. Z. et al. • Simple with less complexity

Cross Model: • World Model and Individual Speaker Models: (Training Set: un-distorted speech) • Same Model • World Model and Individual Speaker Models: (Training Set: distorted speech)

Privacy Protection for Life-log Video