190 likes | 304 Views
Advances in the Parallelization of Music and Audio Applications. Eric Battenberg, David Wessel & Juan Colmenares. Overview. Parallelism today in the popular interactive music languages Parallel Partitioned Convolution
E N D
Advances in the Parallelization of Music and Audio Applications Eric Battenberg, David Wessel & Juan Colmenares
Overview • Parallelism today in the popular interactive music languages • Parallel Partitioned Convolution • Accelerating Non-Negative Matrix Factorization (NMF) for use in audio source separation and music information retrieval and the importance of Selective, Embedded Just In Time Specialization (SEJITS) • Real-time in the Tessellation OS • A plea for more flexible I/0 with GPUs
Current Support for Parallelism is Copy-Based • The widely used languages for music and audio applications are fundamentally sequential in character – this includes Max/MSP, PD, SuperCollider, and CHUCK among others. • Limited multithreading • One approach to exploiting multi-core processors is to run copies of the applications on separate cores. • Max/MSP provides a useful multi-threading mechanism called poly~ . • PD provides PD~ each instance of which runs in a separate thread inside a PD patch.
Partitioned Convolution • First real-time app in the Par Lab. • Partitioned Convolution – an efficient way to do low-latency filtering with a long (> 1 sec) impulse response. • Important in real-time reverb processing for environment simulation. • Sound examples: …convolved with a sine sweep Acoustic Guitar …in a giant mausoleum Impulse response Impulse response
Partitioned Convolution • Convolution: a way to do linear filtering with a finite impulse response (FIR) filter. • Direct convolution: • For length L filter, O(L) ops per output point, zero delay. • L can be greater than 100,000 samples (> 3 sec of audio) • Block FFT Convolution: • Only O(log(L)) ops per output point, but delay of L. • How can we trade off between complexity and latency? x y FFT Complex Mult IFFT H H = FFT(h)
Uniform Partitioned Convolution • We would like the latency to be less than 10ms (512 samples) • Cut an impulse response up into equal-sized blocks. • Then we can use a parallellayout of Block FFT convolverswith delays to implement the filter. • The latency is now N, and we still get complexity savings. 1 2 3 4 5 L N Block FFT Convolver x 1 delay(N) 2 delay(N) + y 3 delay(N) 4 delay(N) 5
Frequency Delay Line Convolution • We can also exploit linearity of the FFT so that only one FFT/IFFT is required. • So the parallel Block FFT Convolver above becomes a Frequency Delay Line (FDL) Convolver: x 1 delay(N) + y 2 Block FFT Convolver delay(N) 3 IFFT FFT Complex Mult H1 H1 x FFT Complex Mult delay(N) H2 + y IFFT Complex Mult delay(N) H3 Complex Mult Frequency Delay Line Convolver
Multiple FDL Convolution • If L is big (e.g. > 100,000) and N is small (e.g. < 1000), our FDL will have 100’s of partitions to handle. • We can connect multiple FDL’s in parallel to get the best of both worlds. x FDL y x FDL 1 delay(Nx6) + FDL 2 y delay(4Nx4) FDL 3
Scheduling Multiple FDLs • FDLs are run in separate threads. • Each is allowed to compute for a length of time corresponding to its block size. • Synchronization is performed at the vertical lines.
Auto-Tuning for Real-Time • We are not trying to only maximize throughput. • We are trying to improve our ability to make real-time guarantees. • For now, we estimate a Worst-Case Execution Time (WCET) for each size of FDL. • Then we combine the FDLs that are most likely to meet their scheduling deadlines. • In the future, we will use a notion of predictability along with more robust scheduling. • We are finishing development on a Max/MSP object, Audio Unit plugin, and a portable standalone version of this.
Accelerating Non-Negative Matrix Factorization (NMF) NMF is widely used in audio source separation. The idea is to factor the time/frequency representation (spectogram) into source coupled spectral (W) and gain (H) matricies.
The Importance of SEJITSin Developing an Information Retrieval (MIR) Application • Rather using a domain restricted language developers write in a full blown scripting language such as PYTHON or RUBY. • Functions are selected by annotation as performance critical. • If efficiency layer implementations of these functions are available appropriate code is generated and JIT compiled. • If not the selected function is executed in the scripting language itself. • The scripted implementation remains as the portable reference implementation.
A real-time application in Tessellation In cooperation with the OS Group Music Program Additional Cells Input Output Filter Parallel version of a partition-based convolution algorithm F Most of the engine’s functionality Intermediate Deadline End-to-end Deadline F Channel 2nd-level RT scheduler A 2nd-level RT scheduler B Shell Sound card Cell B Cell A Initial Cell Audio Processing & Synthesis Engine With this simple music computer application we expect to initially show that Tessellation can provide acceptable performance and time predictability Audio Input
Cell 1.A) Cell and Space Partitioning • A Spatial Partition (or Cell) comprises a group of processors acting within a hardware boundary • Each cell receives a vector of basic resources • Some number of processors, a portion of physical memory, a portion of shared cache memory, and potentially a fraction of memory bandwidth • A cell may also receive • Exclusive access to other resources (e.g., certain hardware devices and raw storage partition) • Guaranteed fractional services (i.e., QoS guarantees) from other partitions (e.g., network service and file service) CPU CPU CPU CPU CPU CPU 2nd-level Scheduling L1 L1 L1 L1 L1 L1 L1 Interconnect Tessellation Kernel (Partition Support) L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank DRAM & I/O Interconnect (+) Fraction of memory bandwidth DRAM DRAM DRAM DRAM DRAM DRAM (*) Bottom part of the diagram was adapted from Liu and Asanovic, “Mitosys: ParLabManycore OS Architecture,” Jan. 2008.
Example of Music Application Music program Audio-processing / Synthesis Engine (Pinned/TT partition) Time-sensitive Network Subsystem Input device (Pinned/TT Partition) Output device (Pinned/TT Partition) GUI Subsystem Network Service (Net Partition) Graphical Interface (GUI Partition) Communication with other audio-processing nodes Preliminary
Large Compute-Bound Application Large Compute-Bound Application Large Compute-Bound Application Large Compute-Bound Application NetworkQoS NetworkQoS NetworkQoS NetworkQoS Monitor And Adapt Monitor And Adapt Monitor And Adapt Monitor And Adapt Large I/O-Bound Application Large I/O-Bound Application Large I/O-Bound Application Large I/O-Bound Application Other Devices Other Devices Other Devices Other Devices Persistent Storage & Parallel File System Persistent Storage & Parallel File System Persistent Storage & Parallel File System Persistent Storage & Parallel File System Disk I/O Drivers Disk I/O Drivers Disk I/O Drivers Disk I/O Drivers Tessellation in Server Environment QoS Guarantees QoS Guarantees Cloud Storage BW QoS QoS Guarantees QoS Guarantees Tessellation OS