Machine Listening in Silicon

Machine Listening in Silicon Part of: “Accelerated Perception & Machine Learning in Stochastic Silicon” project

UIUC: • Students: M. Kim, J. Choi, A. Guzman-Rivera, G. Ko, S. Tsai, E. Kim. • Faculty: Paris Smaragdis, Rob Rutenbar, NareshShanbhag • Intel: • Jeff Parkhurst • RyszardDyrga, Tomasz Szmelczynski – Intel Technology Poland • Georg Stemmer – Intel, Germany • Dan Wartski, OhadFalik – Intel Audio Voice and Speech (AVS), Israel Who?

Project overview • Motivating ideas: • Make machines that can perceive • Use stochastic hardware for stochastic software • Discover new modes of computation • Machine Listening component: • Perceive == Listen • Escape local optimum of Gaussian/MSE/ℓ2

Machine Listening? • Making systems that understand sound • Think computer vision, but for sound • Broad range of fundamentals and applications • Machine learning, DSP, psychoacoustics, music, … • Speech, media analysis, surveying, monitoring, … What can we gather from this?

Machine listening in the wild • Some of this work is already in place • Mostly projects on recognition and detection • More apps in medical, mechanical, geological, architectural, … Highlight discovery In videos Incident discovery in streets Surveillance for emergencies

And there’s more to come • The CrowdMic project • “PhotoSynth for audio”, construct audio recordings from crowdsourced audio snippets • Collaborative audio devices • Harnessing the power of untethered open mics • E.g. conf-call using all phones and laptops in room

The Challenge • Today is all about small form factors • We all carry a couple of mics in our pockets, but we don’t carry the vector processors they need! • Can we come up with new better systems? • Which run on more efficient hardware? • And perform just as well, or better?

The Testbed: Sound Mixtures • Sound has a pesky property, additivity • We almost always observe sound mixtures • Models for sound analysis are “monophonic” • Designed for isolated, clean sounds • So we like to first extractand then process + + =

Focusing on a single sound • There’s no shortage of methods (they all suck by the way) • But these are computationally some of the most demanding algorithms in audio processing • So we instead catered to a different approach that would be a good fit for hardware • i.e. Rob told me that he can do MRFs fast

A bit of background • We like to visualize sounds as spectrograms • 2D representations of energy over time and frequency • For multiple mics we observe level differences • These are known as ILDs (Interaural Level Differences)

Finding sources • For each spectrogram pixel we take an ILD • And plot their histogram • Each sound/location will produce a mode

And we use these as labels • Assign each pixel to a source et voila • But it looks a little ragged

Thus a Markov Random Field • Each pixel is a node that influences its neighbors • Incorporates ILDs and smoothness constraints • Makes my hardware friends happy

The whole pipeline Spectrograms Binary, pairwise MRF freq freq Observe: ILDs RIGHT time time source0 Inference LEFT ~15dB SIR boost Binary Mask: Which freq’sbelong to which source at each time point? source1

Reusing the same core • Oh, and we use this for stereo vision too Obj. 3D depth map by MRF MAP inference Markov Random Field Nodes: Data cost Edges: Smoothness cost Per pixel depth info Iteration

It’s also pretty fast • Our work outperforms up-to-date GPU implementations Performance Result: Single Frame

And we made it error resilient • Algorithmic Noise Tolerance • Power saving by ANT • Complexity overhead = 45% • Estim.: 42 % at Vdd= 0.75V Error Resilient MRF Inference via ANT

Back to source separation again • ILDs suffer front-back confusion and require some distance between the microphones • So we also added Interaural Phase Differences (IPD)

Why add IPDs? • They work best when ILDs fail • E.g. when sensors are far apart 30cm 1cm 15cm Input ILD IPD Joint

Adding one more element • Incorporated NMF-based denoisers • Systems that learn by example what to separate

So what’s next? • Porting the whole system in hardware • We haven’t ported the front-end yet • Evaluating the results with speech recognition • Extending this model to multiple devices • As opposed to one device with multiple mics

Relevant publications • Kim, Smaragdis, Ko, Rutenbar. Stereophonic Spectrogram Segmentation Using Markov Random Fields, in IEEE Workshop for Machine Learning in Signal Processing, 2012 • Kim & Smaragdis. Manifold Preserving Hierarchical Topic Models for Quantization and Approximation, in International Conference on Machine Learning, 2013 • Kim & Smaragdis Single Channel Source Separation Using Smooth Nonnegative Matrix Factorization with Markov Random Fields, in IEEE Workshop for Machine Learning in Signal Processing, 2013 • Kim & Smaragdis. Non-Negative Matrix Factorization for Irregularly-Spaced Transforms, in IEEE Workshop for Applications of Signal Processing in Audio and Acoustics, 2013 • Traa& Smaragdis. 2013. Blind Multi-Channel Source Separation by Circular-Linear Statistical Modeling of Phase Differences, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013 • Choi, Kim, Rutenbar, Shanbhag. Error Resilient MRF Message Passing Hardware for Stereo Matching via Algorithmic Noise Tolerance, IEEE Workshop on Signal Processing Systems, 2013 • Zhang, Ko, Choi, Tsai, Kim, Rivera, Rutenbar, Smaragdis, Park, Narayanan, Xin, Mutlu , Li, Zhao, Chen, Iyer. EMERALD: Characterization of Emerging Applications and Algorithms for Low-power Devices, 2013 IEEE International Symposium on Performance Analysis of Systems and Software, 2013

Machine Listening in Silicon

Machine Listening in Silicon

Presentation Transcript

Listening

Listening

Listening

LISTENING

LISTENING

LISTENING

Listening

Listening

Listening

Listening

LISTENING

Listening

Listening

LISTENING

Listening

“+” = Silicon

Listening

listening

LISTENING

Listening

Listening

Listening