Introduction
Pre requisites
Basic hotplug operation
More options
Driving via libvirt
Debugging aids
Internal details
Future
Introduction
Memory hotplug is a technique or a feature that can be used to dynamically increase or decrease the amount of physical RAM available in the system. In order for the dynamically added memory to become available to the applications, memory hotplug should be supported appropriately at multiple layers like in the firmware and operating system. This blog post mainly looks at the emerging support for memory hotplug in KVM virtualization for PowerPC sPAPR virtual machines (pseries guests). In case of virtual machines, memory hotplug is typically used to vertically scale up or scale down the guest’s physical memory at runtime based on the requirements. This feature is expected to be useful for supporting vertical scaling of PowerPC guests in KVM Cloud environments.
In KVM virtualization, an alternative way to dynamically increase or decrease memory of the guest is to use memory ballooning. While memory ballooning requires cooperation between the guest and the host, memory hotplug is a deterministic way to grow and reduce the memory of the guest.
Pre requisites
Memory hotplug support for PowerPC sPAPR guests is now part of QEMU upstream and is expected to be available starting from QEMU-2.5 release. This implies that memory hotplug support is available for pseries machine types starting from pseries-2.5.
The memory hotplug support QEMU/KVM driver was added in 1.2.14 version of libvirt. Support in libvirt is mostly architecture neutral but some of the memory alignment requirements for PowerPC memory hotplug is being enforced from libvirt-1.2.20 which is the recommended version to exploit memory hotplug feature on PowerPC.
In addition to support in QEMU, libvirt and guest kernel (which btw has existed for a long time), some changes were done in PowerPC RAS tools also to support memory hotplug. The minimum version of these packages needed in the guest are listed below:
rtas_errd daemon which is provided by ppc64_diag package needs to be running in the guest for memory hotplug to function correctly.
Basic hotplug operation
This section describes the steps to be followed by a QEMU user for memory hotplug operation.
* Start guest with those command line options required for memory hotplug.
qemu-system-ppc64 … –m 4G,slots=32,maxmem=32G
-m 4G will start the guest with initial 4G RAM size.
maxmem=32G specifies that this guest’s RAM can grow till 32G via memory hotplug operations.
slots=32 specifies the number of DIMM slots available for this guest to hotplug memory. Like in physical system, each memory hotplug operation is done by populating a DIMM slot in the guest. PowerPC supports a max of 32 DIMM slots of which only 31 are available for hotplug.
* Ensure that rtas_errd daemon is running inside the guest.
# ps aux | grep rtas
root 3685 0.7 0.0 5568 3712 ? Ss 16:49 0:00 rtas_errd
# grep Mem /proc/meminfo
MemTotal: 4146560 kB
MemFree: 2908544 kB
* Connect to QEMU monitor from the host and issue memory hotplug commands
(qemu) object_add memory-backend-ram,id=ram0,size=1G
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0
(qemu) info memory-devices
Memory device [dimm]: “dimm0”
addr: 0x100000000
slot: 0
node: 0
size: 1073741824
memdev: /objects/ram0
hotplugged: true
hotpluggable: true
Hotplugging memory from QEMU monitor is a 2 step operation. In the first step, we create a memory backend object which is memory-backend-ram (ram0) in the above example. Next pc-dimm device is added with ram0 as backing memory object.
* Check that RAM size grow in the guest
# grep Mem /proc/meminfo
MemTotal: 5195136 kB
MemFree: 3020160 kB
More options
In this section a few more options and other possibilities with memory hotplug are explored.
* NUMA guest – If the guest has NUMA topology, it is possible to do hotplug to a particular NUMA node of the guest.
qemu-system-ppc64 … -m 4G,slots=32,maxmem=32G -numa node,nodeid=0,mem=2G,cpus=0-7 -numa node,nodeid=1,mem=2G,cpus=8-15
Here the guest has 4G RAM divided between 2 NUMA nodes as can be seen by the below command in the guest.
# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 2020 MB
node 0 free: 1105 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 2028 MB
node 1 free: 1674 MB
node distances:
node 0 1
0: 10 40
1: 40 10
node= can be specified explicitly with device_add command to hotplug to a given NUMA node.
(qemu) object_add memory-backend-ram,id=ram0,size=1G
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1
(qemu) info memory-devices
Memory device [dimm]: “dimm0”
addr: 0x100000000
slot: 0
node: 1
size: 1073741824
memdev: /objects/ram0
hotplugged: true
hotpluggable: true
Verify the memory getting added to NUMA node 1 in the guest
# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 2020 MB
node 0 free: 971 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 3052 MB
node 1 free: 2610 MB
node distances:
node 0 1
0: 10 40
1: 40 10
* Hugetlbfs backed guest
If guest’s RAM is backed by hugetlbfs, then we could use memory-backend-file to add more memory via hotplug. Assume a guest is started with 16M hugepages like this:
qemu-system-ppc64 … -m 4G,slots=32,maxmem=32G -mem-path /dev/hugepages/hugetlbfs-16M
The hotplug is performed by using memory-backend-file object like this:
(qemu) object_add memory-backend-file,id=ram0,size=1G,mem-path=/dev/hugepages/hugetlbfs-16M
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=0
(qemu) info memory-devices
Memory device [dimm]: “dimm0”
addr: 0x100000000
slot: 0
node: 0
size: 1073741824
memdev: /objects/ram0
hotplugged: true
hotpluggable: true
* Migration
If a guest that has undergone memory hotplug operations needs to be migrated to another host, the memory backend objects and pc-dimm objects should be specified explicitly on the target side using -object and -device options respectively.
If following hotplug operation is done at the source,
(qemu) object_add memory-backend-ram,id=ram0,size=1G
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0
then at the target host, the guest should be started with the following options:
qemu-system-ppc64 … -object memory-backend-ram,id=ram0,size=1G -device pc-dimm,id=dimm0,memdev=ram0 -incoming …
Driving via libvirt
This section describes the steps to perform memory hotplug for a guest that is managed by libvirt.
The guest XML needs to have the following bits:
<maxMemory slots=’32’ unit=’KiB’>33554432</maxMemory>
<memory unit=’KiB’>8388608</memory>
<currentMemory unit=’KiB’>4194304</currentMemory>
<cpu>
<numa>
<cell id=’0′ cpus=’0-127′ memory=’8388608′ unit=’KiB’/>
</numa>
</cpu>
This describes a single NUMA node guest with 4G memory, 32 slots and hotpluggable memory upto 32G.
Hotplug is done by using virsh.
# cat mem-2g.xml
<memory model=’dimm’>
<target>
<size unit=’KiB’>2097152</size>
<node>0</node>
</target>
</memory>
# virsh attach-device <domain> mem-2g.xml
Device attached successfully
More information about other memory hotplug related options supported by libvirt are present here.
Debugging aids
Here are some nice-to-know details about memory hotplug that could come in handy when facing problems.
* Minimum hotplug granularity – The minimum DIMM size that can be hotplugged into sPAPR PowerPC guest is 256MB.
* Memory alignment – With the introduction of memory hotplug support, memory alignment requirements for pseries guests have become stricter. Now the initial RAM size, maxmem size and memory size of individual NUMA nodes must be aligned to 256MB failing which QEMU will refuse to start the guest. Also the DIMM/memory size that gets hotplugged is required to be aligned to 256MB.
* Hotplugging to memory-less NUMA node is not allowed.
* After memory hotplug support, pseries guests with maxmem beyond 1TB might not work. This is due to the limited buffer size that gets passed on from SLOF (guest firmware) to QEMU during ibm,client-architecture-support call that gets issued by the guest early during the boot.
* sPAPR PowerPC guests need a data structure called HTAB (hash table) that stores the virtual to physical page mappings for the guest. HTAB for guest is allocated by the host in host contiguous memory area (CMA) which is a limited resource (by default 5% of host RAM is CMA region). All guests running in the host get their HTAB allocated from this CMA region. HTAB size depends on the maxmem size and specifying huge values of maxmem for guest could result in failures like below:
qemu-system-ppc64 … -m 4G,slots=32,maxmem=1T
qemu-system-ppc64: Failed to allocate HTAB of requested size, try with smaller maxmem
Aborted
In such cases lowering the maxmem is recommended.
* Typically it is expected that rtas_errd daemon is running in the guest before any memory hotplug operation is attempted. If rtas_errd isn’t running, the memory hotplug operation is reported as success at QEMU monitor or by virsh. However the added memory doesn’t get reflected in the guest. Starting rtas_errd would make all those previously added memory to appear in the guest. Also a reboot of the guest would result in such memory to appear in the guest after the reboot.
* libvirt managed guests need to be NUMA aware (at least 1 NUMA node should be defined in the XML) for supporting memory hotplug. This limitation is likely to be relaxed soon.
Internal details
TODO
Future
TODO
* libvirt NUMA node relaxation
* In-kernel hotplug
* Memory hot removal or unplug.