Corey: An Operating System for Many Cores

Corey: An Operating System for Many Cores Silias Boyd-Wickizerr, Haibo Chen, RongChen, Yandong Mao, Frans Kaashoek, Roberrt Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, Zheng Zhang MIT, Fudan University, Microsoft Research Asia, Xi’an Jiaotong University 2008 SOSP 강동우 redcdang@gmail.com

Agenda • Introduction • Motivation • Design • Evaluation • Conclusion

Introduction • Most PCs have or will have multicore chips • Cache-coherent shared memory hardware is the new standard • Performance of some OS services scales very poorly with number of cores/processors

Motivation • Scalability Problem • Benchmark • Number of threads within a process • Each thread cresates a file descriptor, and then each thread repeatedly duplicates(dup) and close • 4 Quad-Core AMD Opteron , Linux 2.6.25

Motivation • AMD 16core System Topology Takes More time to access L1 or L2 cache of another core than accessing shared L3 Cache Takes more time to access Caches in another chip than local caches Kernels must mainly access data in the local core’s cache

MapReduce • Map phase: • Processes read parts of application’s inputs • Generate intermediary results and store them locally • Reduce phase: • Processors collate results produced by multiple map instances • Produce the output

Design • Goal • Allow Application to control Shared Resources • 3 Absractions • Address Range Abstraction • Control page talbes and the kernel data used to manage them • Kernel Core Abstraction • Allow applications to dedicate cores to running particular kernel functions. • Shares Abstraction • Control the kernel data used to resolve application references

Design - Address Range Abstraction • Current 2 Types of Address Space • Single Address Space • Multiple Threads • Separate Address Space • Multiple Processes • Share memory with mmap Extra Soft Page Faults Contention Map is bad, Reduce is good Map is good, Reduce is Bad

Design - Address Range Abstraction • Address Ranges • to give applications high performance for both private and shared memory • kernel-provided abstraction that corresponds to a range of virtual-to-physical mappings • mm_struct Non Contention on Map Phase No Page Fault, Because share the hardware page tables

Design – Kernel Core Abstraction • System calls are executed on the core of the invoking process • if the system call needs to access large shared data structures, it is not good.( Many cache miss ) • Applications can dedicate cores to kernel functions and data • Kernel Core can manage hardware devices and execute system calls sent from other cores.

Design – Shares Abstraction • File Descriptors, Process ID • Many kernel operations involve looking up identifieres in tables to yield a pointer to the relevant kernel data structure • Shared FD table is a bottleneck • Shares • Applications can control how cores share the kernel data structures used to do lookups

Implementation • Low-Level Kernel • Architecture specific functions, Device Drivers, • 11,000 Lines of C, 150 Lines of Assembly • Unix like Environment • 11,000 Lines of C/C++, • Buffer cache, cfork, TCP/IP Stack Interface(lwIP)

Performance Evaluation • AMD 16-Core System • 8GB Memory • Linux kernel 2.6.25 • Pin one thread to each core • Intel Pro/1000 Ethernet Device

Evaluation - Address Range Abstraction • memclone • private memory • Each core allocate its own 100 MB array and modify each page of the array • mempass • shared memory • Allocates a single 100MB array on one of the clones, touches each buffer page and passes it to the next core which repeats the process

Evaluation – Kernel Cores Abstraction • Simple TCP Service • Dedicated • use a kernel core for all network processing • Polling • use a kernel core only to poll for packet notifications and transmit completions

Evaluation – Shares Absraction • 2 Microbenchmarks • Each core calls share_addobj() to add a per core segment to a global share then calls share_delobj() to delete that segment • same but per core segment is added to a local share

Evalution - Applications • MapReduce • wri MapReduce Application • …Maybe Word Count… • 1GB File

Evalution - Applications • Increase performance by dedicating application data to cores • webd application called filesum • return the sum of the bytes in that file • Random mode, Locality mode

Conclusions • Applications should be scaleable on Multicore Architectures • Corey is a new kernel • Address Range, Kernel Core, Share • Show that Can avoid scalability bottlenecks • MapReduce and Web Application

Backup Slides

cfork • cfork(core_id) is an extension of UNIX fork() that creates a new process (pcore) on core core_id • Application can specify multiple levels of sharing between parent and child • Default is copy-on-write

Network • Applications can decide to run • Multiple network stacks • A single shared network stack

Buffer Cache • Shared buffer like regular UNIX buffer cache • Three modifications • A lock-free tree allows multiple cores to locate cached blocks w/o contention • A write scheme tries to minimize contention • A scalable read/write lock

Splin Locks , MCS Locks

MCS Lock • FIFO ordering of lock acquisitions • Critical Section에 접근하려는 Task는 자신을 empty queue에 삽입. • 락을 해제할 때 task는 다음에 사용할 task를 지정 • queue에서 바로 뒤에 들어온 task를 지정 • 장점 • local spin(bus traffic이 적다) • task들간의 contention이 없고, 정해진 순서로 lock을 획득 • waiting time이 일정 시간으로 bound

Thread A on Core 0 Thread B on Core 1 root share root share perprocess share fd fd fd paper.pdf text.txt shared_avi.avi Private Share Shared Share

Corey: An Operating System for Many Cores

Corey: An Operating System for Many Cores

Presentation Transcript

What is an Operating System?

An Operating System for the Home

An Operating System for the Home

An Operating System for the Home

Components of an operating system

NOX: Towards an Operating System for Networks

What is an Operating System?

FOS (Factored Operating System) An Operating System for Multicore and Clouds

MetaTM/TxLinux: Transactional Memory For An Operating System

Exokernel Operating System: An Introduction

What is an operating system?

Silicon Operating System for Large Scale Heterogeneous Cores and its FPGA Implementation

Section 12.1 Prepare for installing an operating system Configure a server operating system

Highly Parallel Line-Based Image Coding for Many Cores

Corey – An Operating System for Many Cores

LiNK: An Operating System Architecture for Network Processors

An Operating System for the Home

What is an Operating System

Corey Scott Dowd Has Worked For Many Popular Establishments