1 / 23

Kata Containers on Arm:

Kata Containers on Arm:. Let’s talk about our progress !. Jia He & Penny Zheng. 4/30/2019. Agenda. Kata Containers Design Hardware Requirements NVDIMM/DAX on AArch64 Memory Hot-add on AArch64. Kata Containers Design. Hardware Requirements. AAr ch64 Bare mental kata-runtime kata-check

melvinf
Download Presentation

Kata Containers on Arm:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kata Containers on Arm: Let’s talk about our progress ! Jia He & Penny Zheng 4/30/2019

  2. Agenda Kata Containers Design Hardware Requirements NVDIMM/DAX on AArch64 Memory Hot-add on AArch64

  3. Kata Containers Design

  4. Hardware Requirements AArch64 Bare mental kata-runtime kata-check • test if system can run Kata Containers Host kernel version: v5.0 (recommended) • kvm: arm64: Dynamic & >40 bit IPA support • IPA Limit: ioctl request KVM_ARM_GET_MAX_VM_PHYS_SHIFT • KVM_CREATE_VM: “type” filed • Merged in v4.20, but the nearest stable version is v5.0 $ kata-runtime kata-check INFO[0000] Unable to know if the system is running inside a VM source=virtcontainers INFO[0000] kernel property found arch=arm64 description="Kernel-based Virtual Machine" name=kvmpid=132472 source=runtime type=module INFO[0000] kernel property found arch=arm64 description="Host kernel accelerator for virtio" name=vhostpid=132472 source=runtime type=module INFO[0000] kernel property found arch=arm64 description="Host kernel accelerator for virtio network" name=vhost_netpid=132472 source=runtime type=module INFO[0000] System is capable of running Kata Containers arch=arm64 name=kata-runtime pid=132472 source=runtime INFO[0000] device available arch=arm64 check-type=full device=/dev/kvm name=kata-runtime pid=132472 source=runtime INFO[0000] feature available arch=arm64 check-type=full feature=create-vm name=kata-runtime pid=132472 source=runtime INFO[0000] System can currently create Kata Containers arch=arm64 name=kata-runtime pid=132472 source=runtime

  5. NVDIMM On AArch64 Virtio-blk as guest rootfs (original) • -device virtio-blk,drive=image-091275cdf53f819a,scsi=off,config-wce=off,romfile= -drive id=image-091275cdf53f819a,file=/usr/share/kata-containers/kata-containers-2018-11-08-02:07:30.763626711+0800-osbuilder-0123f8f-agent-0f411fd,aio=threads,format=raw,if=none Fatal error: couldn’t launch more than two containers simultaneously docker: Error response from daemon: OCI runtime create failed: qemu-system-aarch64: -device virtio-blk,drive=image-9f100592ac95eec6,scsi=off,config-wce=off: Failed to get "write" lock Is another process using the image?: unknown. Why we switch to use NVDIMM on AArch64? write lock error

  6. NVDIMM/DAX on Kata Containers

  7. NVDIMM on AArch64 PoP (Point of Persistence) • ARMv8.2 at least • The point in non-volatile memory (aka. Persistent Memory) DC CVAP Clean data cache by address to Point of Persistence HWCAP_DCPOP and dcpop in /proc/cpuinfo Kernel will use DC CVAC operations if DC CVAP is not supported CPU 3 CPU 2 CPU 1 CPU 0 I D I D I D I D L2 Cache (Unification) L2 Cache (Unification) L3 Cache Device Memory (coherency) write lock error Persistent Memory

  8. vNVDIMM on Kata Containers PoC is enough for virtual NVDIMM

  9. Memory Hot-add on AArch64 Physical Memory Hot-add phase • Bridge communication between hardware/firmware and kernel • ACPI (automatically) • Probe Interface • Kernel recognizes new memory, makes new memory management tables, and makes sysfs files for new memory Logical Memory Hot-add phase • Change memory state into available for users (online) • Add a goroutine 24-hours listening to memory hot-add uevents (in kata) • Set CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y (automatically) ACPI for x86_64 and Probe interface for AArch64

  10. Memory Hot-add on AArch64 First Step: kata-runtime kata-agent guest getGuestDetails check probe interface /sys/devices/system /memory/probe store the result

  11. Memory Hot-add on AArch64 Second Step: kata-agent qemu-system-aarch64 guest kata-runtime memHotplugByProbe query-memory-devices free slot number echo addr object_add /sys/devices /system/memory/probe memory-backend-ram device_add pc-dimm acquire uevent query-memory-devices online memory memory device addr

  12. Memory Hot-add on AArch64 $ docker run -it --runtime kata-runtime -m 3G ubuntu WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. root@40bc669706b2:/# ls /sys/devices/system/memory/ auto_online_blocks memory1 memory3 memory5 probe block_size_bytes memory2 memory4 power uevent root@40bc669706b2:/# cat /sys/devices/system/memory/block_size_bytes 40000000 root@40bc669706b2:/# lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x0000000040000000-0x000000017fffffff 5G online no 1-5 Memory block size: 1G Total online memory: 5G Total offline memory: 0B Current Status guest kernel • v5.0 + • Probe interface (packaging repo) kata-runtime kata-agent qemu • upstream review

  13. Agenda • CPU hotplug in arm64 qemu • VM templating • SR-IOV virtual functions hotplug • TODO

  14. Update Resource VM VM Kata agent 2048 M memory 128M memory 1 VCPU n VCPUs Mem/CPU hotplug via gRPC Kata runtime

  15. CPU hotplug in arm64 qemu • Runtime (#1262#1489) agent (#478) • Limitations on arm64: • The apci based cpuhotplug has not been supported in qemu guest so far, and even no clear plan for the future •  Arm GIC architecture can’t handle hot-adding vcpu after booting. • Mentioned by Mark Rutland & Christoffer Dall (commit 716139df): • We'd need something along those lines. Each CPU has a notional point to point link to the interrupt controller (to the redistributor, to be precise), and this entity must pre-exist. • When the vgic initializes its internal state it does so based on the number of VCPUs available at the time.  If we allow KVM to create more VCPUs after the VGIC has been initialized, we are likely to error out in unfortunate ways later, perform buffer overflows etc.

  16. qemu Kata qmp Guest vm start guest with -smp 4 -append “maxcpus=1” start cpus(1 online,3 offline) query_hotpluggable_cpus return cpu node info device_add/del cpu online/offline cpus by gRPC Send DEVICE_DELETED event

  17. Pros vs Cons: • Pros • Hypervisor doesn’t need to support acpi or other hardware cpuhotplug. This is really helpful to arm64 or even firecracker. • Code is easy to implement.  • Cons • docker container density on arm64.  • Concerns from community:  • “this is an orchestration workaround that neither the hypervisor nor the architecture support. I'd prefer to get that feature implemented from the hypervisor.” • “the biggest drawback of the proposed approach is security”

  18. VM templating Template Pool Memory backend file RAM share=on save load VM template Device state VM template VM template Vm incoming migration load save Kara runtime

  19. VM templating • Still more work is needed on arm64 • Commit 18269069 ("migration: Introduce ignore-shared capability") adds ignore-shared capability to bypass the shared ram block (e,g, membackend file with share=on). It does good to live migration. • It is expected that QEMU doesn‘t write anything to guest RAM until VM starts, but it does in arm64 qemu • rom block “dtb” will be filled into RAM during rom_reset. In incoming case, this rom filling seems to be not required since all the data have been stored in memory backend file already. • Catherine Ho submitted a proposal to fix this, which is under upstream review.

  20. SR-IOV vfhotplug VM VM 128M memory 2048 M memory • Update Resource • Mem/cpuhotplug 1 VCPU n VCPUs PCI devices • PCI hotplug Kata runtime • Eric Auger’s work(KVM PCIe/MSI Passthrough on Arm/Arm64) in qemu and kernel in 2016

  21. SR-IOV vfhotplug • agent issue #414 $ sudo docker network create -d sriov --internal --opt pf_iface=enp1s0f0 --opt vlanid=100 --subnet=192.168.0.0/24 vfnet • vf driver (e.g. mlx5) might generate a random mac address which is different from the vf mac address in the host. • Kata runtime and agent will use mac address as thetop priority to identify the link is the same or not.  • Proposal: use pcibdf address to search the link instead

  22. TODO on arm64 • virtio-fs • nvdimm dax support • nemu support • firecracker support • Kubernetes integration test • Metrics • ...

More Related