UltraGesture: Cutting-Edge Gesture Sensing Technology

Kang Ling†, Haipeng Dai†, Yuntang Liu†, and Alex X. Liu†‡ UltraGesture: Fine-Grained GestureSensing and Recognition Nanjing University†， Michigan State University‡ SECON'18 June 12th, 2018

Outline Motivation Evaluation Solution Doppler vs. FMCW vs. CIR

Motivation (b) Smartwatch with tiny input screen (a) AR/VR require new UI technology (c) Some inconvenient scenarios

Related works (b) Kinect (a) RF-Idraw, SigComm’14 (c) Wisee, MobiCom’13 (d) Wifinger, UbiComp’16

Related works (ultrasound) (b) Soundwave, CHI’12 (a) Doplink, UbiComp’13 (c) AudioGest, UbiComp’16 (d) LLAP, MobiCom’16 (e) FingerIO, CHI’16

Principle Gesture recognition in ultrasound area relies on either speed estimation or distance estimation. speed distance

Doppler Effect 0.398m/s 23.43Hz 2048 48kHz 20kHz 340m/s

FMCW chirp 4.25cm 2048 4kHz 2048 340m/s

Channel Impulse Response - CIR The received sound signal can be classed into three components in time domain, which are direct sound, early reflections,and late reverberation. direct sound & early reflections late reverberation

Channel Impulse Response - CIR We use a Least Square (LS) method to estimate the CIR and get a theoretical distance resolution of 7.08 mm. h(t) Linear system x(t) y(t) Time resolution: 1/Fs = 1/48000 = 0.02ms Distance resolution: c/Fs = 340/48000 = 7.08mm

Doppler vs. FMCW vs. CIR Direct sound + static reflections

System overview Our system includes a training stage and a recognition stage. We propose to use a CNN model to recognize the dCIR images.

CIR - Barker Code We choose Barker code as our baseband data, because of its idealautocorrelation property. (b) Barker code autocorrelation (a) Barker code 13 (c) Sent baseband S[n] (d) Received baseband R[n]

CIR - up-down conversion The baseband signal are modulated to 20 kHz for inaudible requirement. A passband filter is used for avoiding freque-ncy leakage.

CIR - channel estimation We estimate the CIRin each frame (10ms) through a Least-Square(LS) method.

CIR - gesture image Assemble the CIRs along time, we can get a CIR image. Subtract CIR values from last frame, the moving part can be revealed, we call this image dCIR image.

dCIR samples

dCIR - gesture detection We leverage the variance of dCIR samples along time to perform gesture detection.

CNN model - data tensor Actually, we can get dCIR measurement at each micro-phone, combine all microphones’ data together, we can get a data tensor. We collect dCIR image at most of 4 microphones in our self designed speaker-microphone kit.

CNN model We choose CNN(Convolutional Neural Network) as the classifier in our work. • Classifiers, such as SVM, KNN, may miss valuable information in feature extraction process. • CNN is good at classifying high dimension tensor datai.e. image classification.

CNN model - input layer We set the input layer of our CNN model as a 160*140*4 data tensor. Input data tensor: 160 frames (1.6 seconds) 140 dCIR indexes 4 microphones

CNN model - structure • CNN Architecture: • Input → [ Conv→ Relu→ Pooling ] * layers → [ FC ]→ Output • Our UltraGesture recognition model takes about 2.48 M parameters when there are 5layers.

Evaluations - devices We implemented our system and collected data from a Samsung S5 mobile phone and a self designed kit. • Samsung S5: 1 speaker and 2 microphones • Self-designed kit: 2 speakers (1 used) and 4 microphones (a) Samsung S5 (b) Self-designed kit

Evaluations - gestures We collected data for 12basic gestures performed by 10 volunteers (8 males and 2 females) with a time span of over a month under different scenarios.

Evaluations - 1 The average recognition accuracy is 97.92% for a ten-fold cross validation result. Data source: Samsung S5 Normal Office environment With noise from air conditioner and servers

Evaluations - 2 • We test the performance under different scenarios: • The recognition accuracy promotes from 92% (1 microphone) to 98% (4 microphones). • The overall gesture recognition accuracy drops slightly from 98% to 93%when the noise level increases from 55dB to 65dB.

Evaluations - 3 We test the system performance under some typical usage scenarios: • New users: 92.56% • Left hand: 98.00% • With gloves on: 97.33% • Occlusion: 85.67% • Using UltraGesture while playing music: 88.81%

Conclusion • The contributions of our work can be concluded as follows: • We analyzed the inherent drawbacks of existing Doppler and FMCW based methods. • We proposed to use CIR measurement to achieve higher distance resolution. • We combined the deep learning framework CNN to achieve high recognition accuracy.

Q & A Email: lingkang@smail.nju.edu.cn

UltraGesture: Cutting-Edge Gesture Sensing Technology