200 likes | 350 Views
Operating System Support for Fine-Grain Parallelism on Multicore Architectures. John Giacomoni. Manish Vachharajani University of Colorado at Boulder 2007.10.14. Problem. UP performance at “end of life” Chip-Multiprocessor systems What do we want from multicore systems?.
E N D
Operating System Support forFine-Grain Parallelism on Multicore Architectures John Giacomoni Manish Vachharajani University of Colorado at Boulder 2007.10.14
Problem • UP performance at “end of life” • Chip-Multiprocessor systems • What do we want from multicore systems? • Individual cores less powerful than UP • Asymmetric and Heterogeneous • 10s-100s-1000s of cores Performance! Intel (2x2-core) MIT RAW (16-core) 100-core 400-core
ExtractingPerformance • Task Parallelism • Desktop • Data Parallelism • Web serving • Split/Join, MapReduce, etc… • Pipeline Parallelism • Video decoding • Network processing
ExtractingPerformance (2) • Stream Parallelism • Combines • Data Parallelism • Pipeline Parallelism • Ad-Hoc Parallelism • Semi- or unstructured • Usual thread model
Focus onPipeline Parallelism • Most stringent timing requirements • Example applications: • Network Processing • Network Intrusion Detection • DDoS Filtering • Multimedia processing • Transcoding • Signal Processing • Software Defined Radio • Also applies to • Data parallelism • Stream Parallelism
Soft Network Processing(Soft-NP) • How do we protect? • GigE Network Properties: • 1,488,095 frames/sec • 672 ns/frame • Frame dependencies “Frame Shared Memory: Line-Rate Networking on Commodity Hardware”. To Appear: Proceedings of the ACM/IEEE Symposium on Architectures for Networking and Communications Systems 2007 (ANCS), December 2007. John Giacomoni, John K. Bennett, Antonio Carzaniga, Douglas C. Sicker, Manish Vachharajani and Alexander L. Wolf.
Frame Shared Memory(Soft-NP) Input (IP) Output(OP)
Low-OverheadCommunication Gigabit Ethernet Syscalls ~170ns pthread mutex ~200ns
FastForward • Portable software only framework • ~35-40ns/queue operation 2.0 GHz AMD Opteron • ~26-28ns/queue operation 2.6 GHz AMD Opteron • Architecturally tuned CLF queues • Works with strong to weak consistency models • Hides die-die communication • Robust against unbalanced stages • Poster: “FastForward for Efficient Pipeline Parallelism”. Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2007. John Giacomoni, Tipp Moseley, Manish Vachharajani.
FastForwardPerformance Lamport FF FF Unbalanced FF Re-Balanced
GangScheduling • Optimize for application performance • Instead of system throughput or fairness • Computer Utility -> max(System Utilization) • Multicore system -> excess of resources. • Dedicate resources to pipeline applications • Want selective timesharing
SystemServices • Fast! • Synchronous calls introduce too much overhead • System calls ~ 170ns • Asynchronous calls may limit parallelism • Want: System services with independent I/O paths
PipelinableSystem Services • Mixing stages from multiple process domains • Push model vs. call/return or poll • Hardware can be an active participant
HeterogeneousGang Scheduling • Need a single scheduling label for every pipeline stage • Ensures simultaneous scheduling of every necessary resource • (zero-stall guarantee) • Including hardware stages. • Scheduling multi-domain entities
Multi-DomainEntities • Application state • Shared with local stages • Pipeline private state • Stage state shared with pipeline and parent process. • The multi-domain application model respects the private data model implicit in single-domain applications while providing first-class naming for multi-domain pipelines.
Summaryof Discussion • Low-overhead communication • Zero-stall guarantee • Selective timesharing • Pipelineable system services • Heterogenous gang scheduling • Pipelines as multi-domain applications
Questions? john.giacomoni@colorado.edu