220 likes | 400 Views
Machine Listening in Silicon. Part of: “Accelerated Perception & Machine Learning in Stochastic Silicon” project. UIUC: Students: M. Kim, J. Choi, A. Guzman-Rivera, G. Ko , S. Tsai, E. Kim. Faculty: Paris Smaragdis, Rob Rutenbar , Naresh Shanbhag Intel: Jeff Parkhurst
E N D
Machine Listening in Silicon Part of: “Accelerated Perception & Machine Learning in Stochastic Silicon” project
UIUC: • Students: M. Kim, J. Choi, A. Guzman-Rivera, G. Ko, S. Tsai, E. Kim. • Faculty: Paris Smaragdis, Rob Rutenbar, NareshShanbhag • Intel: • Jeff Parkhurst • RyszardDyrga, Tomasz Szmelczynski – Intel Technology Poland • Georg Stemmer – Intel, Germany • Dan Wartski, OhadFalik – Intel Audio Voice and Speech (AVS), Israel Who?
Project overview • Motivating ideas: • Make machines that can perceive • Use stochastic hardware for stochastic software • Discover new modes of computation • Machine Listening component: • Perceive == Listen • Escape local optimum of Gaussian/MSE/ℓ2
Machine Listening? • Making systems that understand sound • Think computer vision, but for sound • Broad range of fundamentals and applications • Machine learning, DSP, psychoacoustics, music, … • Speech, media analysis, surveying, monitoring, … What can we gather from this?
Machine listening in the wild • Some of this work is already in place • Mostly projects on recognition and detection • More apps in medical, mechanical, geological, architectural, … Highlight discovery In videos Incident discovery in streets Surveillance for emergencies
And there’s more to come • The CrowdMic project • “PhotoSynth for audio”, construct audio recordings from crowdsourced audio snippets • Collaborative audio devices • Harnessing the power of untethered open mics • E.g. conf-call using all phones and laptops in room
The Challenge • Today is all about small form factors • We all carry a couple of mics in our pockets, but we don’t carry the vector processors they need! • Can we come up with new better systems? • Which run on more efficient hardware? • And perform just as well, or better?
The Testbed: Sound Mixtures • Sound has a pesky property, additivity • We almost always observe sound mixtures • Models for sound analysis are “monophonic” • Designed for isolated, clean sounds • So we like to first extractand then process + + =
Focusing on a single sound • There’s no shortage of methods (they all suck by the way) • But these are computationally some of the most demanding algorithms in audio processing • So we instead catered to a different approach that would be a good fit for hardware • i.e. Rob told me that he can do MRFs fast
A bit of background • We like to visualize sounds as spectrograms • 2D representations of energy over time and frequency • For multiple mics we observe level differences • These are known as ILDs (Interaural Level Differences)
Finding sources • For each spectrogram pixel we take an ILD • And plot their histogram • Each sound/location will produce a mode
And we use these as labels • Assign each pixel to a source et voila • But it looks a little ragged
Thus a Markov Random Field • Each pixel is a node that influences its neighbors • Incorporates ILDs and smoothness constraints • Makes my hardware friends happy
The whole pipeline Spectrograms Binary, pairwise MRF freq freq Observe: ILDs RIGHT time time source0 Inference LEFT ~15dB SIR boost Binary Mask: Which freq’sbelong to which source at each time point? source1
Reusing the same core • Oh, and we use this for stereo vision too Obj. 3D depth map by MRF MAP inference Markov Random Field Nodes: Data cost Edges: Smoothness cost Per pixel depth info Iteration
It’s also pretty fast • Our work outperforms up-to-date GPU implementations Performance Result: Single Frame
And we made it error resilient • Algorithmic Noise Tolerance • Power saving by ANT • Complexity overhead = 45% • Estim.: 42 % at Vdd= 0.75V Error Resilient MRF Inference via ANT
Back to source separation again • ILDs suffer front-back confusion and require some distance between the microphones • So we also added Interaural Phase Differences (IPD)
Why add IPDs? • They work best when ILDs fail • E.g. when sensors are far apart 30cm 1cm 15cm Input ILD IPD Joint
Adding one more element • Incorporated NMF-based denoisers • Systems that learn by example what to separate
So what’s next? • Porting the whole system in hardware • We haven’t ported the front-end yet • Evaluating the results with speech recognition • Extending this model to multiple devices • As opposed to one device with multiple mics
Relevant publications • Kim, Smaragdis, Ko, Rutenbar. Stereophonic Spectrogram Segmentation Using Markov Random Fields, in IEEE Workshop for Machine Learning in Signal Processing, 2012 • Kim & Smaragdis. Manifold Preserving Hierarchical Topic Models for Quantization and Approximation, in International Conference on Machine Learning, 2013 • Kim & Smaragdis Single Channel Source Separation Using Smooth Nonnegative Matrix Factorization with Markov Random Fields, in IEEE Workshop for Machine Learning in Signal Processing, 2013 • Kim & Smaragdis. Non-Negative Matrix Factorization for Irregularly-Spaced Transforms, in IEEE Workshop for Applications of Signal Processing in Audio and Acoustics, 2013 • Traa& Smaragdis. 2013. Blind Multi-Channel Source Separation by Circular-Linear Statistical Modeling of Phase Differences, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013 • Choi, Kim, Rutenbar, Shanbhag. Error Resilient MRF Message Passing Hardware for Stereo Matching via Algorithmic Noise Tolerance, IEEE Workshop on Signal Processing Systems, 2013 • Zhang, Ko, Choi, Tsai, Kim, Rivera, Rutenbar, Smaragdis, Park, Narayanan, Xin, Mutlu , Li, Zhao, Chen, Iyer. EMERALD: Characterization of Emerging Applications and Algorithms for Low-power Devices, 2013 IEEE International Symposium on Performance Analysis of Systems and Software, 2013