1 / 24

Container 内核原理介绍

Container 内核原理介绍. 邱模炯 @ UCloud. 虚拟化技术: VM 与 Container. App. App. App. Binaries/ Libraries. Binaries/ Libraries. Binaries/ Libraries. App. App. App. Guest Kernel. Guest Kernel. Guest Kernel. Binaries/ Libraries. Binaries/ Libraries. Binaries/ Libraries. Hypervisor(VMM).

lihua
Download Presentation

Container 内核原理介绍

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Container内核原理介绍 邱模炯 @ UCloud

  2. 虚拟化技术:VM与Container App App App Binaries/ Libraries Binaries/ Libraries Binaries/ Libraries App App App Guest Kernel Guest Kernel Guest Kernel Binaries/ Libraries Binaries/ Libraries Binaries/ Libraries Hypervisor(VMM) Host Kernel Host Kernel Hardware Hardware System Virtualization Container Virtualization

  3. 虚拟化技术:VM与Container • 内容: • namespace • cgroup • aufs • 两者对比 App App App Binaries/ Libraries Binaries/ Libraries Binaries/ Libraries App App App Binaries/ Libraries Guest Kernel Guest Kernel Guest Kernel Binaries/ Libraries Binaries/ Libraries Hypervisor(VMM) namespace & cgroup Host Kernel Host Kernel Hardware Hardware System Virtualization Container Virtualization

  4. namespace:进程组虚拟化的手段 • 进程运行环境有哪些? • VFS mount, 即文件系统rootfs • uid, gid • network,即独立的网络设备和tcpip • pid, 父pid • devices, hostinfo, IPC, /proc, /sys等 • 进程虚拟化/隔离 • 已有ns • pid:进程ID • mnt:文件系统挂载点 • net:网络栈 • uts:主机名 • ipc:进程间通信 • user:用户ID, 组ID,capabilities • 子进程自动继承父的ns

  5. nsproxy相当于运行环境 UTS0 struct nsproxy: uts_namespace ipc_namespace pid_namespace mnt_namespace net struct task_struct{ struct nsproxy *nsproxy ... } IPC0 PID0 MNT0 struct task_struct{ struct nsproxy *nsproxy ... } NET0 struct nsproxy: uts_namespace ipc_namespace pid_namespace mnt_namespace net UTS1 struct task_struct{ struct nsproxy *nsproxy ... } PID1 NET1

  6. mnt ns提供私有的rootfs struct mnt_namespace{ struct mount * root; ... } • 每个mnt ns有自己的rootfs,相对于host的根目录

  7. pid ns映射后的pid空间 struct pid_namespace{ struct pidmap pidmap[PIDMAP_ENTRIES]; ... }

  8. net ns提供独立的网络栈 每一个ns • 私有的网络设备 • lo, veth等虚设备 • 物理网卡 • 独立的协议栈 • ipv4, ipv6(含IP地址和路由表) • tcp, sctp, dccp • iptables规则 • ipvs等

  9. 其他ns • uts, ipc • user ns • C1的1001 id和C2的1001不是同一 • 主要处理uid, gid, capabilities的虚拟化 • 3.8版本提交;需要文件系统配合,目前不完善 • 还有什么namespace没有实现? • time, device, security keys, security

  10. namespace能做到什么? 虚拟机,但资源无保证 • 独立的mnt (chroot) • 独立的pid空间 • 独立的网络协议栈 • uts • uid, gid • ipc App App App Binaries/ Libraries Binaries/ Libraries Binaries/ Libraries namespace Host Kernel Hardware

  11. cgroup: 资源隔离和统计 /cgroup ulimit的群体版 • 哪些控制组 • memory • usage_in_bytes • limit_in_bytes • stat • kmem... • kmem.tcp... • memsw... • cpu • blkio • cpuset, freezer, net_cls, net_prio, devices, perf, cpuacct, hugetlb • mount –t cgroup none /cgroup cont1 cont2 apache mysql ftp group1: 内存上限3072M, group2: 内存上限1024M,

  12. cgroup: 内存子系统接口 memory.limit_in_bytes memory.soft_limit_in_bytes memory.usage_in_bytes memory.failcnt memory.max_usage_in_bytes memory.stat memory.kmem.limit_in_bytes memory.kmem.usage_in_bytes memory.kmem.failcnt memory.kmem.max_usage_in_bytes memory.kmem.tcp.xxx 总体内存 kmem内存 kmem.tcp内存

  13. cgroup: 内存子系统原理 mem_cgroup • mem_cgroup包含该group允许使用内存上限,当前使用量,LRU等信息 • 每个进程task_struct知道自己属于哪个group • 每个页面page有对应的page_cgroup信息,从page_cgroup可知道该页面属于哪个group • 在page fault或分配page cache页面时,系统会设置page_cgroup属于哪个mem_cgroup task_struct task_struct mm_struct mm_struct page page_cgroup

  14. cgroup: CPU子系统接口 cpu.shares group间CPU调度权重 cpu.stat cpu.cfs_period_us cpu.cfs_quota_us cpu.rt_runtime_us cpu.rt_period_us

  15. cgroup: CPU子系统原理 Complete Faire Scheduling调度的基础是red-black tree,二叉排序树。 启用cgroup前,所有可运行进程都被排序 启用cgroup后,变为两级调度! 第一级:container间 第二级:container内 root p4 group1 group2 p2 p6 p2 p5 p1 p3 p5 p1 p3 p4 p6

  16. cgroup: 其它子系统 • blkio • cpuset, freezer, net_cls, net_prio, devices, perf, cpuacct, hugetlb

  17. namespace+ cgroup能做到? 虚拟机且资源有保证 但,如何使得container精简? App App App Binaries/ Libraries Binaries/ Libraries Binaries/ Libraries namespace & cgroup Host Kernel Hardware

  18. aufs overview • Another Union File System 同类:Overlayfs /base boot bin lib lib64 usr sbin /union boot bin lib lib64 usr sbin data /c1 data mount -t aufs -o br=/base=ro:/c1=rw none /union

  19. aufs: 原理 aufs

  20. aufs: 有待完善 • 一个写操作导致整个文件的拷贝 • branch文件重名时的处理 • 如何确保可预测可推理的行为 • “下层”文件被绕开修改 • “上层”编辑文件,新branch恰有同名文件 /rw (none) /union fileA /ro fileA /new fileA

  21. Container VS VM: 内核是运行环境的一部分 • 内核版本与libc有一定耦合 • 内核特性 • 系统调用,/proc, /sys • 磁盘设备名 • 系统配置文件 程序 用户态环境 Container 内核 VM 硬件

  22. Container VS VM:应用场景 • 性能和大小 • 运行开销、部署速度、启动速度 • 迁移 • 隔离和安全 适用场景?

  23. Thank You

More Related