CPU hotplug support in PowerKVM

August 9, 2016
Introduction
Pre requisites
Basic hotplug operation
More options
Driving via libvirt
KVM Forum 2016 presentation

Introduction

CPU hotplug is a technique or a feature that can be used to dynamically increase or decrease the number of CPUs available in the system. In order for the dynamically added CPUs to become available to the applications, CPU hotplug should be supported appropriately at multiple layers like in the firmware and operating system. This blog post mainly looks at the emerging support for CPU hotplug in KVM virtualization for PowerPC sPAPR virtual machines (pseries guests). In case of virtual machines, CPU hotplug is typically used to vertically scale up or scale down the guest CPUs at runtime based on the requirements. This feature is expected to be useful for supporting vertical scaling of PowerPC guests in KVM Cloud environments. Memory hotplug, which is also part of VM scale up requirement was discussed in my previous post.

CPU hotplug is supported for PowerPC sPAPR by means of device_add/device_del interfaces and not via the legacy cpu-add interface.

Pre requisites

CPU hotplug support for PowerPC sPAPR guests is now part of QEMU upstream and is expected to be available starting from QEMU-2.7 release. This implies that CPU hotplug support is available for pseries machine types starting from pseries-2.7.

In addition to QEMU and guest kernel (which btw has existed for a long time), some changes were done in PowerPC RAS tools also to support CPU hotplug. The minimum version of these packages needed in the guest are listed below:

Package Minimum required version
powerpc-utils 1.2.26
ppc64_diag 2.6.8
librtas 1.3.9

rtas_errd daemon which is provided by ppc64_diag package needs to be running in the guest for CPU hotplug to function correctly.

Basic hotplug operation

This section describes the steps to be followed by a QEMU user for CPU hotplug operation.

* CPU hotplug will be performed in core granularity on PowerPC. For eg. if the core has 8 threads, then one CPU hotplug operation will result in addition of 1 core with 8 CPU threads.

* Start a single thread guest with those command line options required for CPU hotplug.

qemu-system-ppc64 … –smp 2,cores=4,threads=1,maxcpus=4 -cpu POWER8E

-smp 2 will start the guest with 2 CPUs.
maxcpus=4 specifies that this guest can have a maximum of 4 CPUs, which implies 2 more CPUs can be added via CPU hotplug operations.
cores=4,threads=1 specifies the topology of the guest (4 cores each with 1 thread)
-cpu POWER8E specifies the guest CPU model

* Ensure that rtas_errd service is running inside the guest.

# ps aux | grep rtas
root      2518  0.3  0.0   6016  3968 ?        Ss   11:52   0:00 rtas_errd

[root@localhost ~]# lscpu
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1

* Connect to QEMU monitor from the host to discover hotpluggable CPUs and issue CPU hotplug command.

query-hotpluggable-cpus is the QMP interface while “info hotpluggable-cpus” is the HMP variant of the same.

(qemu) info hotpluggable-cpus
Hotpluggable CPUs:

type: “POWER8E-spapr-cpu-core”
vcpus_count: “1”
CPUInstance Properties:
core-id: “3”

type: “POWER8E-spapr-cpu-core”
vcpus_count: “1”
CPUInstance Properties:
core-id: “2”

type: “POWER8E-spapr-cpu-core”
vcpus_count: “1”
qom_path: “/machine/unattached/device[2]”
CPUInstance Properties:
core-id: “1”

type: “POWER8E-spapr-cpu-core”
vcpus_count: “1”
qom_path: “/machine/unattached/device[1]”
CPUInstance Properties:
core-id: “0”

type specifies the CPU core device type to be used with device_add command
vcpus_count specifies the number of threads the core has
core-id specifies the value for the core-id property that needs to be specified during device_add
qom_path if present implies that the CPU core is already plugged in, if absent it implies that the same can be hotplugged

To hot add a CPU,

(qemu) device_add POWER8E-spapr-cpu-core,id=core2,core-id=2

The added CPU core should be visible within the guest as well as in QEMU monitor

[root@localhost ~]# lscpu
CPU(s):                3
On-line CPU(s) list:   0-2
Thread(s) per core:    1

qemu) info hotpluggable-cpus
Hotpluggable CPUs:

type: “POWER8E-spapr-cpu-core”
vcpus_count: “1”
CPUInstance Properties:
core-id: “3”

  type: “POWER8E-spapr-cpu-core”
  vcpus_count: “1”
  qom_path: “/machine/peripheral/core2” <— hot added CPU
  CPUInstance Properties:
    core-id: “2”

type: “POWER8E-spapr-cpu-core”
vcpus_count: “1”
qom_path: “/machine/unattached/device[2]”
CPUInstance Properties:
core-id: “1”

type: “POWER8E-spapr-cpu-core”
vcpus_count: “1”
qom_path: “/machine/unattached/device[1]”
CPUInstance Properties:
core-id: “0”

To hot remove a CPU,

(qemu) device_del core2

More options

In this section a few more options and other possibilities with CPU hotplug are explored.

* NUMA guest – The NUMA semantics for CPU hotplug is likely to undergo a change, but this is how it works currently:

qemu-system-ppc64 … –smp 2,cores=4,threads=1,maxcpus=4 -numa node,nodeid=0,cpus=0,cpus=2 -numa node,nodeid=1,cpus=1,cpus=3

This guest will have 4 CPUs with existing CPU=0 and hotpluggable CPU=2 on NUMA node 0 while existing CPU=1 and hotpluggable CPU=3 on NUMA node 1. Currently the NUMA node affinity of the CPUs isn’t shown in the HMP/QMP queries.

The steps to discover, hot-add and hot-remove are same as shown in the previous example, just that the hot-added CPU will automatically end up on the NUMA node that was specified during boot using -numa cmdline.

The CPU added like below

(qemu) device_add POWER8E-spapr-cpu-core,id=core2,core-id=2

will end up in NUMA node 1.

[root@localhost ~]# numactl -H | grep cpus
node 0 cpus: 0 2
node 1 cpus: 1

* CPU compat option

The legacy way to specify CPU compat option is like this:

-cpu host,compat=power7

However with QEMU-2.7, compat option can be specified using -global option also so that the specified compat level is applied uniformly to boot CPUs as well as hotplugged CPUs.

-cpu host -global driver=host-powerpc64-cpu,property=compat,value=power7

The hotplug steps are similar to the previous example.

(qemu) device_add host-spapr-cpu-core,id=core2,core-id=2

Note that compat option will not be specified with every hotplugged CPU as it has already been set as global option which applies to hotplugged CPUs as well.

* Migration
If a guest that has undergone CPU hotplug operations needs to be migrated to another host, like any other device, CPU core device should be specified explicitly on the target side using  -device options.

If following hotplug operation is done at the source,

(qemu) device_add POWER8E-spapr-cpu-core,id=core2,core-id=2

then at the target host, the guest should be started with the following options:
qemu-system-ppc64 … -device POWER8E-spapr-cpu-core,id=core2,core-id=2

* Example of SMT4 guest

-smp 4,cores=4,threads=4,maxcpus=16

(qemu) info hotpluggable-cpus
Hotpluggable CPUs:

type: “host-spapr-cpu-core”
vcpus_count: “4”
CPUInstance Properties:
core-id: “12”

type: “host-spapr-cpu-core”
vcpus_count: “4”
CPUInstance Properties:
core-id: “8”

type: “host-spapr-cpu-core”
vcpus_count: “4”
CPUInstance Properties:
core-id: “4”

type: “host-spapr-cpu-core”
vcpus_count: “4”
qom_path: “/machine/unattached/device[1]”
CPUInstance Properties:
core-id: “0”

(qemu) device_add host-spapr-cpu-core,core-id=12,id=core4
In this example, two things are worth noting:

– The core-ids listed by the HMP query will be 0, 4, 8, … since this is an SMT4 guest.
– Any CPU can be hot added like we added core-id=12 here.

Driving via libvirt

Support for this device based CPU hotplug is still under development in libvirt.

KVM Forum 2016 presentation

The slides for CPU hotplug KVM Forum 2016 presentation can be found here and the presentation video can be seen here.


Memory hotplug support in PowerKVM

October 9, 2015

Introduction
Pre requisites
Basic hotplug operation
More options
Driving via libvirt
Debugging aids
Internal details
Future

Introduction

Memory hotplug is a technique or a feature that can be used to dynamically increase or decrease the amount of physical RAM available in the system. In order for the dynamically added memory to become available to the applications, memory hotplug should be supported appropriately at multiple layers like in the firmware and operating system. This blog post mainly looks at the emerging support for memory hotplug in KVM virtualization for PowerPC sPAPR virtual machines (pseries guests). In case of virtual machines, memory hotplug is typically used to vertically scale up or scale down the guest’s physical memory at runtime based on the requirements. This feature is expected to be useful for supporting vertical scaling of PowerPC guests in KVM Cloud environments.

In KVM virtualization, an alternative way to dynamically increase or decrease memory of the guest is to use memory ballooning. While memory ballooning requires cooperation between the guest and the host, memory hotplug is a deterministic way to grow and reduce the memory of the guest.

Pre requisites

Memory hotplug support for PowerPC sPAPR guests is now part of QEMU upstream and is expected to be available starting from QEMU-2.5 release. This implies that memory hotplug support is available for pseries machine types starting from pseries-2.5.

The memory hotplug support QEMU/KVM driver was added in 1.2.14 version of libvirt. Support in libvirt is mostly architecture neutral but some of the memory alignment requirements for PowerPC memory hotplug is being enforced from libvirt-1.2.20 which is the recommended version to exploit memory hotplug feature on PowerPC.

In addition to support in QEMU, libvirt and guest kernel (which btw has existed for a long time), some changes were done in PowerPC RAS tools also to support memory hotplug. The minimum version of these packages needed in the guest are listed below:

Package Minimum required version
powerpc-utils 1.2.26
ppc64_diag 2.6.8
librtas 1.3.9

rtas_errd daemon which is provided by ppc64_diag package needs to be running in the guest for memory hotplug to function correctly.

Basic hotplug operation

This section describes the steps to be followed by a QEMU user for memory hotplug operation.

* Start guest with those command line options required for memory hotplug.

qemu-system-ppc64 … –m 4G,slots=32,maxmem=32G

-m 4G will start the guest with initial 4G RAM size.
maxmem=32G specifies that this guest’s RAM can grow till 32G via memory hotplug operations.
slots=32 specifies the number of DIMM slots available for this guest to hotplug memory. Like in physical system, each memory hotplug operation is done by populating a DIMM slot in the guest. PowerPC supports a max of 32 DIMM slots of which only 31 are available for hotplug.

* Ensure that rtas_errd daemon is running inside the guest.

# ps aux | grep rtas
root      3685  0.7  0.0   5568  3712 ?        Ss   16:49   0:00 rtas_errd

# grep Mem /proc/meminfo
MemTotal:        4146560 kB
MemFree:         2908544 kB

* Connect to QEMU monitor from the host and issue memory hotplug commands

(qemu) object_add memory-backend-ram,id=ram0,size=1G
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0
(qemu) info memory-devices
Memory device [dimm]: “dimm0”
addr: 0x100000000
slot: 0
node: 0
size: 1073741824
memdev: /objects/ram0
hotplugged: true
hotpluggable: true

Hotplugging memory from QEMU monitor is a 2 step operation. In the first step, we create a memory backend object which is memory-backend-ram (ram0) in the above example. Next pc-dimm device is added with ram0 as backing memory object.

* Check that RAM size grow in the guest

# grep Mem /proc/meminfo
MemTotal:        5195136 kB
MemFree:         3020160 kB

More options

In this section a few more options and other possibilities with memory hotplug are explored.

* NUMA guest – If the guest has NUMA topology, it is possible to do hotplug to a particular NUMA node of the guest.

qemu-system-ppc64 … -m 4G,slots=32,maxmem=32G -numa node,nodeid=0,mem=2G,cpus=0-7 -numa node,nodeid=1,mem=2G,cpus=8-15

Here the guest has 4G RAM divided between 2 NUMA nodes as can be seen by the below command in the guest.

# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 2020 MB
node 0 free: 1105 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 2028 MB
node 1 free: 1674 MB
node distances:
node   0   1
0:  10  40
1:  40  10

node= can be specified explicitly with device_add command to hotplug to a given NUMA node.

(qemu) object_add memory-backend-ram,id=ram0,size=1G
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1
(qemu) info memory-devices
Memory device [dimm]: “dimm0”
addr: 0x100000000
slot: 0
node: 1
size: 1073741824
memdev: /objects/ram0
hotplugged: true
hotpluggable: true

Verify the memory getting added to NUMA node 1 in the guest

# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 2020 MB
node 0 free: 971 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 3052 MB
node 1 free: 2610 MB
node distances:
node   0   1
0:  10  40
1:  40  10

* Hugetlbfs backed guest

If  guest’s RAM is backed by hugetlbfs, then we could use memory-backend-file to add more memory via hotplug. Assume a guest is started with 16M hugepages like this:

qemu-system-ppc64 … -m 4G,slots=32,maxmem=32G -mem-path /dev/hugepages/hugetlbfs-16M

The hotplug is performed by using memory-backend-file object like this:

(qemu) object_add memory-backend-file,id=ram0,size=1G,mem-path=/dev/hugepages/hugetlbfs-16M
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=0
(qemu) info memory-devices
Memory device [dimm]: “dimm0”
addr: 0x100000000
slot: 0
node: 0
size: 1073741824
memdev: /objects/ram0
hotplugged: true
hotpluggable: true

* Migration

If a guest that has undergone memory hotplug operations needs to be migrated to another host, the memory backend objects and pc-dimm objects should be specified explicitly on the target side using -object and -device options respectively.

If following hotplug operation is done at the source,

(qemu) object_add memory-backend-ram,id=ram0,size=1G
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0

then at the target host, the guest should be started with the following options:

qemu-system-ppc64 … -object memory-backend-ram,id=ram0,size=1G -device pc-dimm,id=dimm0,memdev=ram0 -incoming …

Driving via libvirt

This section describes the steps to perform memory hotplug for a guest that is managed by libvirt.

The guest XML needs to have the following bits:

<maxMemory slots=’32’ unit=’KiB’>33554432</maxMemory>
<memory unit=’KiB’>8388608</memory>
<currentMemory unit=’KiB’>4194304</currentMemory>
<cpu>
<numa>
<cell id=’0′ cpus=’0-127′ memory=’8388608′ unit=’KiB’/>
</numa>
</cpu>

This describes a single NUMA node guest with 4G memory, 32 slots and hotpluggable memory upto 32G.

Hotplug is done by using virsh.

# cat mem-2g.xml
<memory model=’dimm’>
<target>
<size unit=’KiB’>2097152</size>
<node>0</node>
</target>
</memory>

# virsh attach-device <domain> mem-2g.xml
Device attached successfully

More information about other memory hotplug related options supported by libvirt are present here.

Debugging aids

Here are some nice-to-know details about memory hotplug that could come in handy when facing problems.

* Minimum hotplug granularity – The minimum DIMM size that can be hotplugged into sPAPR PowerPC guest is 256MB.
* Memory alignment – With the introduction of memory hotplug support, memory alignment requirements for pseries guests have become stricter. Now the initial RAM size, maxmem size and memory size of individual NUMA nodes must be aligned to 256MB failing which QEMU will refuse to start the guest. Also the DIMM/memory size that gets hotplugged is required to be aligned to 256MB.
* Hotplugging to memory-less NUMA node is not allowed.
* After memory hotplug support, pseries guests with maxmem beyond 1TB might not work. This is due to the limited buffer size that gets passed on from SLOF (guest firmware) to QEMU during ibm,client-architecture-support call that gets issued by the guest early during the boot.
* sPAPR PowerPC guests need a data structure called HTAB (hash table) that stores the virtual to physical page mappings for the guest. HTAB for guest is allocated by the host in host contiguous memory area (CMA) which is a limited resource (by default 5% of host RAM is CMA region). All guests running in the host get their HTAB allocated from this CMA region. HTAB size depends on the maxmem size and specifying huge values of maxmem for guest could result in failures like below:

qemu-system-ppc64 … -m 4G,slots=32,maxmem=1T
qemu-system-ppc64: Failed to allocate HTAB of requested size, try with smaller maxmem
Aborted

In such cases lowering the maxmem is recommended.

* Typically it is expected that rtas_errd daemon is running in the guest before any memory hotplug operation is attempted. If rtas_errd isn’t running, the memory hotplug operation is reported as success at QEMU monitor or by virsh. However the added memory doesn’t get reflected in the guest. Starting rtas_errd would make all those previously added memory to appear in the guest. Also a reboot of the guest would result in such memory to appear in the guest after the reboot.
* libvirt managed guests need to be NUMA aware (at least 1 NUMA node should be defined in the XML) for supporting memory hotplug. This limitation is likely to be relaxed soon.

Internal details

TODO

Future

TODO

* libvirt NUMA node relaxation
* In-kernel hotplug
* Memory hot removal or unplug.


GlusterFS Block Device Translator

November 27, 2013

Block device translator

Block device translator (BD xlator) is a new translator added to GlusterFS recently which provides block backend for GlusterFS. This replaces the existing bd_map translator in GlusterFS that provided similar but very limited functionality. GlusterFS expects the underlying brick to be formatted with a POSIX compatible file system. BD xlator changes that and allows for having bricks that are raw block devices like LVM which needn’t have any file systems on them. Hence with BD xlator, it becomes possible to build a GlusterFS volume comprising of bricks that are logical volumes (LV).

bd

BD xlator maps underlying LVs to files and hence the LVs appear as files to GlusterFS clients. Though BD volume externally appears very similar to the usual Posix volume, not all operations are supported or possible for the files on a BD volume. Only those operations that make sense for a block device are supported and the exact semantics are described in subsequent sections.

While Posix volume takes a file system directory as brick, BD volume needs a volume group (VG) as brick. In the usual use case of BD volume, a file created on BD volume will result in an LV being created in the brick VG. In addition to a VG, BD volume also needs a file system directory that should be specified at the volume creation time. This directory is necessary for supporting the notion of directories and directory hierarchy for the BD volume. Metadata about LVs (size, mapping info) is stored in this directory.

BD xlator was mainly developed to use block devices directly as VM images when GlusterFS is used as storage for KVM virtualization. Some of the salient points of BD xlator are

  • Since BD supports file level snapshots and clones by leveraging the snapshot and clone capabilities of LVM, it can be used to fully off-load snapshot and cloning operations from QEMU to the storage (GlusterFS) itself.
  • BD understands dm-thin LVs and hence can support files that are backed by thinly provisioned LVs. This capability of BD xlator translates to having thinly provisioned raw VM images.
  • BD enables thin LVs from a thin pool to be used from multiple nodes that have visibility to GlusterFS BD volume. Thus thin pool can be used as a VM image repository allowing access/visibility to it from multiple nodes.
  • BD supports true zerofill by using BLKZEROOUT ioctl on underlying block devices. Thus BD allows SCSI WRITESAME to be used on underlying block device if the device supports it.

Though BD xlator is primarily intended to be used with block devices, it does provide full Posix xlator compatibility for files that are created on BD volume but are not backed by or mapped to a block device. Such files which don’t have a block device mapping exist on the Posix directory that is specified during BD volume creation.

Availability

BD xlator developed by M. Mohan Kumar was committed into GlusterFS git in November 2013 and is expected to be part of upcoming GlusterFS-3.5 release.

Compiling BD translator

BD xlator needs lvm2 development library. –enable-bd-xlator option can be used with ./configure script to explicitly enable BD translator. The following snippet from the output of configure script shows that BD xlator is enabled for compilation.

GlusterFS configure summary
===================

Block Device xlator  : yes

Creating a BD volume

BD supports hosting of both linear LV and thin LV within the same volume. However I will be showing them separately in the following instructions. As noted above, the prerequisite for a BD volume is VG which I am creating here from a loop device, but it can be any other device too.

1. Creating BD volume with linear LV backend

– Create a loop device

[root@bharata ~]# dd if=/dev/zero of=bd-loop count=1024 bs=1M
[root@bharata ~]# losetup /dev/loop0 bd-loop

– Prepare a brick by creating a VG

[root@bharata ~]# pvcreate /dev/loop0
[root@bharata ~]# vgcreate bd-vg /dev/loop0

– Create the BD volume

Create a POSIX directory first
[root@bharata ~]# mkdir /bd-meta

It is recommended that this directory is created on an LV in the brick VG itself so that both data and metadata live together on the same device.

Create and mount the volume
[root@bharata ~]# gluster volume create bd bharata:/bd-meta?bd-vg force

The general syntax for specifying the brick is host:/posix-dir?volume-group-name where “?” is the separator.

[root@bharata ~]# gluster volume start bd
[root@bharata ~]# gluster volume info bd

Volume Name: bd
Type: Distribute
Volume ID: cb042d2a-f435-4669-b886-55f5927a4d7f
Status: Started
Xlator 1: BD
Capability 1: offload_copy
Capability 2: offload_snapshot
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: bharata:/bd-meta
Brick1 VG: bd-vg

[root@bharata ~]# mount -t glusterfs bharata:/bd /mnt

– Create a file that is backed by an LV

[root@bharata ~]# ls /mnt
[root@bharata ~]#

Since the volume is empty now, so is the underlying VG.
[root@bharata ~]# lvdisplay bd-vg
[root@bharata ~]#

Creating a file that is mapped to an LV is a 2 step operation. First the file should be created on the mount point and a specific extended attribute should be set to map the file to LV.

[root@bharata ~]# touch /mnt/lv
[root@bharata ~]# setfattr -n “user.glusterfs.bd” -v “lv” /mnt/lv

Now an LV got created in the VG brick and the file /mnt/lv maps to this LV. Any read/write to this file ends up as read/write to the underlying LV.
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                          /dev/bd-vg/6ff0f25f-2776-4d19-adfb-df1a3cab8287
LV Name                        6ff0f25f-2776-4d19-adfb-df1a3cab8287
VG Name                       bd-vg
LV UUID                         PjMPcc-RkD5-RADz-6ixG-UYsk-oclz-vL0nv6
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:15:45 +0530
LV Status                      available
# open                          0
LV Size                    4.00 MiB
Current LE                   1
Segments                     1
Allocation                     inherit
Read ahead sectors    0
Block device                253:6

The file gets created with default LV size which is 1 LE which is 4MB in this case.
[root@bharata ~]# ls -lh /mnt/lv
-rw-r–r–. 1 root root 4.0M Nov 26 16:15 /mnt/lv

truncate can be used to set the required file size.
[root@bharata ~]# truncate /mnt/lv -s 256M
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                          /dev/bd-vg/6ff0f25f-2776-4d19-adfb-df1a3cab8287
LV Name                        6ff0f25f-2776-4d19-adfb-df1a3cab8287
VG Name                       bd-vg
LV UUID                         PjMPcc-RkD5-RADz-6ixG-UYsk-oclz-vL0nv6
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:15:45 +0530
LV Status                       available
# open                           0
LV Size                     256.00 MiB
Current LE                   64
Segments                      1
Allocation                      inherit
Read ahead sectors     0
Block device                 253:6

[root@bharata ~]# ls -lh /mnt/lv
-rw-r–r–. 1 root root 256M Nov 26 16:15 /mnt/lv

The size of the file/LV can be specified during creation/mapping time itself like this:
setfattr -n “user.glusterfs.bd” -v “lv:256MB” /mnt/lv

2. Creating BD volume with thin LV backend

– Create a loop device

[root@bharata ~]# dd if=/dev/zero of=bd-loop-thin count=1024 bs=1M
[root@bharata ~]# losetup /dev/loop0 bd-loop-thin

– Prepare a brick by creating a VG and thin pool

[root@bharata ~]# pvcreate /dev/loop0
[root@bharata ~]# vgcreate bd-vg-thin /dev/loop0

Create a thin pool
[root@bharata ~]# lvcreate –thin bd-vg-thin -L 1000M
Rounding up size to full physical extent 4.00 MiB
Logical volume “lvol0” created

lvdisplay shows the thin pool
[root@bharata ~]# lvdisplay bd-vg-thin
— Logical volume —
LV Name                       lvol0
VG Name                      bd-vg-thin
LV UUID                        HVa3EM-IVMS-QG2g-oqU6-1UxC-RgqS-g8zhVn
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:39:06 +0530
LV Pool transaction ID  0
LV Pool metadata          lvol0_tmeta
LV Pool data                  lvol0_tdata
LV Pool chunk size       64.00 KiB
LV Zero new blocks     yes
LV Status                      available
# open                          0
LV Size                          1000.00 MiB
Allocated pool data     0.00%
Allocated metadata     0.88%
Current LE                   250
Segments                     1
Allocation                     inherit
Read ahead sectors     auto
– currently set to         256
Block device                253:9

– Create the BD volume

Create a POSIX directory first
[root@bharata ~]# mkdir /bd-meta-thin

Create and mount the volume
[root@bharata ~]# gluster volume create bd-thin bharata:/bd-meta-thin?bd-vg-thin force
[root@bharata ~]# gluster volume start bd-thin
[root@bharata ~]# gluster volume info bd-thin

Volume Name: bd-thin
Type: Distribute
Volume ID: 27aa7eb0-4ffa-497e-b639-7cbda0128793
Status: Started
Xlator 1: BD
Capability 1: thin
Capability 2: offload_copy
Capability 3: offload_snapshot
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: bharata:/bd-meta-thin
Brick1 VG: bd-vg-thin
[root@bharata ~]# mount -t glusterfs bharata:/bd-thin /mnt

– Create a file that is backed by a thin LV

[root@bharata ~]# ls /mnt
[root@bharata ~]#

Creating a file that is mapped to a thin LV is a 2 step operation. First the file should be created on the mount point and a specific extended attribute should be set to map the file to a thin LV.

[root@bharata ~]# touch /mnt/thin-lv
[root@bharata ~]# setfattr -n “user.glusterfs.bd” -v “thin:256MB” /mnt/thin-lv

Now /mnt/thin-lv is a thin provisioned file that is backed by a thin LV.
[root@bharata ~]# lvdisplay bd-vg-thin
— Logical volume —
LV Name                        lvol0
VG Name                       bd-vg-thin
LV UUID                         HVa3EM-IVMS-QG2g-oqU6-1UxC-RgqS-g8zhVn
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:39:06 +0530
LV Pool transaction ID 1
LV Pool metadata         lvol0_tmeta
LV Pool data                 lvol0_tdata
LV Pool chunk size       64.00 KiB
LV Zero new blocks     yes
LV Status                      available
# open                         0
LV Size                         1000.00 MiB
Allocated pool data    0.00%
Allocated metadata    0.98%
Current LE                  250
Segments                    1
Allocation                    inherit
Read ahead sectors   auto
– currently set to        256
Block device               253:9

— Logical volume —
  LV Path                     /dev/bd-vg-thin/081b01d1-1436-4306-9baf-41c7bf5a2c73
LV Name                        081b01d1-1436-4306-9baf-41c7bf5a2c73
VG Name                       bd-vg-thin
LV UUID                         coxpTY-2UZl-9293-8H2X-eAZn-wSp6-csZIeB
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:43:19 +0530
LV Pool name                 lvol0
LV Status                       available
# open                           0
  LV Size                     256.00 MiB
Mapped size                  0.00%
Current LE                    64
Segments                      1
Allocation                      inherit
Read ahead sectors     auto
– currently set to          256
Block device                 253:10

As can be seen from above, creation of a file resulted in creation of a thin LV in the brick.

Snapshots and clones

BD xlator uses LVM snapshot and clone capabilities to provide file level snapshots and clones for files on GlusterFS volume. Snapshots and clones work only for those files that have been already mapped to an LV. In other words, snapshots and clones aren’t for Posix-only file that exist on BD volume.

Creating a snapshot

Say we are interested in taking snapshot of a file /mnt/file that already exists and has been mapped to an LV.

[root@bharata ~]# ls -l /mnt/file
-rw-r–r–. 1 root root 268435456 Nov 27 10:16 /mnt/file

[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                        /dev/bd-vg/abf93bbd-2c78-4612-8822-c4e0a40c4626
LV Name                      abf93bbd-2c78-4612-8822-c4e0a40c4626
VG Name                      bd-vg
LV UUID                        HwSRTL-UdPH-MMz7-rg7U-pU4a-yS4O-59bDfY
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-27 10:16:54 +0530
LV Status                      available
# open                          0
LV Size                          256.00 MiB
Current LE                   64
Segments                     1
Allocation                     inherit
Read ahead sectors    0
Block device                253:6

Snapshot creation is a two step process.

– Create a snapshot destination file first
[root@bharata ~]# touch /mnt/file-snap

– Then take the actual snapshot
In order to create the actual snapshot, we need to know the GFID of the snapshot file.

[root@bharata ~]# getfattr -n glusterfs.gfid.string  /mnt/file-snap
getfattr: Removing leading ‘/’ from absolute path names
# file: mnt/file-snap
glusterfs.gfid.string=”bdf74e38-dc96-4b26-94e2-065fe3b8bcc3″

Use this GFID string to create the actual snapshot
[root@bharata ~]# setfattr -n snapshot -v bdf74e38-dc96-4b26-94e2-065fe3b8bcc3 /mnt/file
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                         /dev/bd-vg/abf93bbd-2c78-4612-8822-c4e0a40c4626
LV Name                       abf93bbd-2c78-4612-8822-c4e0a40c4626
VG Name                      bd-vg
LV UUID                        HwSRTL-UdPH-MMz7-rg7U-pU4a-yS4O-59bDfY
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-27 10:16:54 +0530
LV snapshot status   source of bdf74e38-dc96-4b26-94e2-065fe3b8bcc3 [active]
LV Status                       available
# open                           0
LV Size                           256.00 MiB
Current LE                   64
Segments                      1
Allocation                      inherit
Read ahead sectors     0
Block device                 253:6

— Logical volume —
LV Path                        /dev/bd-vg/bdf74e38-dc96-4b26-94e2-065fe3b8bcc3
LV Name                      bdf74e38-dc96-4b26-94e2-065fe3b8bcc3
VG Name                      bd-vg
LV UUID                        9XH6xX-Sl64-uNhk-7OiH-f91m-DaMo-6AWiBD
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-27 10:20:35 +0530
LV snapshot status  active destination for abf93bbd-2c78-4612-8822-c4e0a40c4626
LV Status                      available
# open                          0
LV Size                          256.00 MiB
Current LE                   64
COW-table size             4.00 MiB
COW-table LE               1
Allocated to snapshot  0.00%
Snapshot chunk size    4.00 KiB
Segments                      1
Allocation                      inherit
Read ahead sectors     auto
– currently set to          256
Block device                 253:7

As can be seen from the lvdisplay output, /mnt/file-snap now is the snapshot of /mnt/file.

Creating a clone

Creating a clone is similar to creating a snapshot except that “clone” attribute name should be used instead of “snapshot”.

setfattr -n clone -v <gfid-of-clone-file> <path-to-source-file>

Clone in BD volume is essentially a server off-loaded full copy of the file.

Concerns

As you have seen, creation of block device backed file on BD volume, creation of snapshots and clones involve non-standard steps including setting of extended attributes. These steps could be cumbersome for an end user and there are plans to encapsulate all these into nice APIs that users could use easily.


Troubleshooting QEMU-GlusterFS

August 21, 2013

As described in my previous blog post, QEMU supports talking to GlusterFS using libgfapi which is a much better way to use GlusterFS to host VM images than using the FUSE mount to access GlusterFS volumes. However due to some bugs that exist in GlusterFS-3.4, any invalid specification of GlusterFS drive on QEMU command line can result in completely non-obvious error messages from QEMU. The aim of this blog post is to document some of these known error scenarios in QEMU-GlusterFS so that users can figure out what could be potentially wrong in their QEMU-GlusterFS setup.

As of this writing (Aug 2013), I know about two bugs in GlusterFS that cause all this confusion in the failure messages:

  • A bug that results in GlusterFS log messages not reaching stderr because of which no meaningful failure messages are shown from QEMU. This bug has been fixed, and it should eventually make it to some release (probably 3.4.1) of GlusterFS.
  • A bug in a libgfapi routine called glfs_init() that results in returning failure with errno set to 0. QEMU depends on errno here and this leads to unpleasant segmentation faults in QEMU.

Environment

I am using Fedora 19 with distro-provided QEMU and GlusterFS rpms for all the experiments here.

# rpm -qa | grep qemu
qemu-system-x86-1.4.2-5.fc19.x86_64
ipxe-roms-qemu-20130517-2.gitc4bce43.fc19.noarch
qemu-kvm-1.4.2-5.fc19.x86_64
qemu-guest-agent-1.4.2-3.fc19.x86_64
libvirt-daemon-driver-qemu-1.0.5.1-1.fc19.x86_64
qemu-img-1.4.2-5.fc19.x86_64
qemu-common-1.4.2-5.fc19.x86_64

# rpm -qa | grep gluster
glusterfs-3.4.0-8.fc19.x86_64
glusterfs-libs-3.4.0-8.fc19.x86_64
glusterfs-fuse-3.4.0-8.fc19.x86_64
glusterfs-server-3.4.0-8.fc19.x86_64
glusterfs-api-3.4.0-8.fc19.x86_64
glusterfs-debuginfo-3.4.0-8.fc19.x86_64
glusterfs-cli-3.4.0-8.fc19.x86_64

Error scenarios

1. glusterd service not started

On Fedora 19, glusterd service fails to start during boot and if you don’t notice it, you can end up using QEMU with GlusterFS when glusterd service isn’t running. In this scenario, you will encounter the following kind of failure:

[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test image=F17-qcow2 transport=tcp
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: could not open disk image gluster://kvm-gluster/test/F17-qcow2: No data available

2. GlusterFS volume not started

If QEMU is used to boot a VM image on GlusterFS volume which is not in started state, the following kind of error is seen:

[root@localhost ~]# gluster volume status test
Volume test is not started
[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test image=F17-qcow2 transport=tcp
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: could not open disk image gluster://kvm-gluster/test/F17-qcow2: No data available

3. QEMU run as non-root user

As of this writing (Aug 2013), GlusterFS doesn’t support non-root users to access GlusterFS volumes via libgfapi without some manual settings. This is a very typical scenario for most users who don’t use QEMU directly but use it via libvirt and oVirt. In this scenario, the following kind of error is seen:

[bharata@localhost ~]$ qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test image=F17-qcow2 transport=tcp
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: could not open disk image gluster://kvm-gluster/test/F17-qcow2: No data available

As you can see, 3 different failure scenarios produced similar error messages. This will improve once error logs from GlusterFS start reaching QEMU when GlusterFS with the fix is used.

4. Specifying non-existing GlusterFS volume or server

Specifying invalid or wrong server or volume names can cause nasty failures like this:

[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test-xxx/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test-xxx/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test-xxx image=F17-qcow2 transport=tcp
Segmentation fault (core dumped)

[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster-xxx/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster-xxx/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster-xxx port=0 volume=test image=F17-qcow2 transport=tcp
Segmentation fault (core dumped)

This should get fixed when the bug in libgfapi is resolved.


UNMAP/DISCARD support in QEMU-GlusterFS

August 7, 2013

In my last blog post on QEMU-GlusterFS, I described the integration of QEMU with GlusterFS using libgfapi. In this post, I give an overview of the recently added discard support to QEMU’s GlusterFS back-end and how it can be used. Newer SCSI devices support UNMAP command that is used to return the unused/freed blocks back to the storage. This command is typically useful when the storage is thin provisioned like a thin provisioned SCSI LUN. In response to a file deletion, the host device driver could send down an UNMAP command to the SCSI target and instruct it to free the relevant blocks from the thin provisioned LUN. This leads to much better utilization of the storage.

Linux support for discard

In Linux, SCSI UNMAP is supported via the generic discard framework which I believe is also used to support ATA TRIM command. ATA TRIM command typically used in SSD isn’t the topic of discussion of this blog. There are multiple ways in which discard functionality is invoked or used in Linux.

  • For direct block devices, one could use BLKDISCARD ioctl to release the unused blocks.
  • File systems like EXT4 support a file level discard using FALLOC_FL_PUNCH_HOLE option of fallocate system call.
  • For releasing the unused blocks at the file systems level, fstrim command can be used.
  • Finally, file systems like EXT4 also support ‘discard’ mount option that will control if file system (EXT4) should issue UNMAP requests to the underlying block device when there are free blocks.

QEMU support for discard

UNMAP is primarily useful in two ways for KVM virtualization.

  • When a file is deleted in the VM, the resulting UNMAP in the guest is passed down to host which will result in host sending the discard request to the thin provisioned SCSI device. This results in the blocks consumed by the deleted file to be returned back to the SCSI storage. The effect is same when there is an explicit discard request from the guest using either ioctl or fallocate methods listed in the previous section.
  • When a VM image is deleted, there is a potential to return the freed blocks back to the storage by sending the UNMAP command to the SCSI storage.

Guest UNMAP requests will end up in QEMU only if the guest is using scsi or virtio-scsi device and not virtio-blk device. QEMU will forward this request further down to the host (device or file system) only if ‘discard=on’ or ‘discard=unmap’ drive flag is used for the device on the QEMU command line.

Example1: qemu-system-x86_64 -drive file=/images/vm.img,if=scsi,discard=on
Example2: qemu-system-x86_64 -device virtio-scsi-pci -drive if=none,discard=on,id=rootdisk,file=gluster://host/volume/image -device scsi-hd,drive=rootdisk

The way discard request is further passed down in the host is determined by the block driver inside QEMU which is serving the disk image type. While QEMU uses fallocate(FALLOC_FL_PUNCH_HOLE) for raw file backends and ioctl(BLKDISCARD) for block device back-end, other backends use their own interfaces to pass down the discard request.

GlusterFS support for discard

GlusterFS, starting from version 3.4, supports discard functionality through a new API called glfs_discard(glfs_fd, offset, size) that is available as part of libgfapi library.  On the GlusterFS server side, discard request is handled differently for posix back-end and Block Device(BD) back-end.

For the posix back-end, fallocate(FALLOC_FL_PUNCH_HOLE) is used to eventually release the blocks to the filesystem. If the posix brick has been mounted with ‘-o discard’ option, then the discard request will eventually reach the SCSI storage if the storage device supports UNMAP.

Support for BD back-end is planned to come up as soon as the ongoing development work on the newer and feature rich BD translator becomes upstream. BD translator should be using ioctl(BLKDISCARD) to UNMAP the blocks.

Discard support in GlusterFS back-end of QEMU

As described in my earlier blog, QEMU starting from version 1.3 supports GlusterFS back-end using libgfapi which is the non-FUSE way of accessing GlusterFS volumes. Recently I added discard support to GlusterFS block driver in QEMU and this support should be available in QEMU-1.6 onwards. This work involved using the glfs_discard() API from QEMU GlusterFS driver to send the discard request to GlusterFS server.

With this, the entire KVM virtualization stack using GlusterFS back-end is enabled to take advantage of SCSI UNMAP command.

Typical usage

This section describes a typical use case with QEMU-GlusterFS where a file deleted from inside the VM results in discard requests for the host storage.

Step 1: Prepare a block device that supports UNMAP

Since I don’t have a real SCSI device that supports UNMAP, I am going to use a loop device to host my GlusterFS volume. Loop device supports discard.

[root@llmvm02 bharata]# dd if=/dev/zero of=discard-loop count=1024 bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.63255 s, 1.7 GB/s

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 2097152    IO Block: 4096   regular file
Device: fc13h/64531d    Inode: 15740290    Links: 1

[root@llmvm02 bharata]# losetup /dev/loop0 discard-loop

[root@llmvm02 bharata]# cat /sys/block/loop0/queue/discard_max_bytes
4294966784

Step 2: Prepare the brick directory for GlusterFS volume

[root@llmvm02 bharata]# mkfs.ext4 /dev/loop0
[root@llmvm02 bharata]# mount -o discard /dev/loop0 /discard-mnt/
[root@llmvm02 bharata]# mount  | grep discard
/dev/loop0 on /discard-mnt type ext4 (rw,discard)

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 66368      IO Block: 4096   regular file

Step 3: Create a GlusterFS volume

[root@llmvm02 bharata]# gluster volume create discard llmvm02:/discard-mnt/ force
volume create: discard: success: please start the volume to access data

[root@llmvm02 bharata]# gluster volume start discard
volume start: discard: success

[root@llmvm02 bharata]# gluster volume info discard
Volume Name: discard
Type: Distribute
Volume ID: ed7a6f8a-9cb8-463d-a948-61974cb64c99
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: llmvm02:/discard-mnt

Step 4: Create a sparse file in the GlusterFS volume

[root@llmvm02 bharata]# glusterfs -s llmvm02 –volfile-id=discard /mnt
[root@llmvm02 bharata]# touch /mnt/file
[root@llmvm02 bharata]# truncate -s 450M /mnt/file

[root@llmvm02 bharata]# stat /mnt/file
File: `/mnt/file’
Size: 471859200     Blocks: 0          IO Block: 131072 regular file

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 66368      IO Block: 4096   regular file

Step 5: Use this sparse file on GlusterFS volume as a disk drive with QEMU

[root@llmvm02 bharata]# qemu-system-x86_64 –enable-kvm -nographic -m 8192 -smp 2 -device virtio-scsi-pci -drive if=none,cache=none,id=F17,file=gluster://llmvm02/test/F17 -device scsi-hd,drive=F17 -drive if=none,cache=none,id=gluster,discard=on,file=gluster://llmvm02/discard/file -device scsi-hd,drive=gluster -kernel /home/bharata/linux-2.6-vm/kernel1 -initrd /home/bharata/linux-2.6-vm/initrd1 -append “root=UUID=d29b972f-3568-4db6-bf96-d2702ec83ab6 ro rd.md=0 rd.lvm=0 rd.dm=0 SYSFONT=True KEYTABLE=us rd.luks=0 LANG=en_US.UTF-8 console=tty0 console=ttyS0 selinux=0”

The file appears as a SCSI disk in the guest.

[root@F17-kvm ~]# dmesg | grep -i sdb
sd 0:0:1:0: [sdb] 921600 512-byte logical blocks: (471 MB/450 MiB)
sd 0:0:1:0: [sdb] Write Protect is off
sd 0:0:1:0: [sdb] Mode Sense: 63 00 00 08
sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn’t support DPO or FUA

Step 6: Generate discard requests from the guest

Format a FS on the SCSI disk, mount it and create a file on it

[root@F17-kvm ~]# mkfs.ext4 /dev/sdb
[root@F17-kvm ~]# mount -o discard /dev/sdb /guest-mnt/
[root@F17-kvm ~]# dd if=/dev/zero of=/guest-mnt/file bs=1M count=400
400+0 records in
400+0 records out
419430400 bytes (419 MB) copied, 1.07204 s, 391 MB/s

Now we can see the blocks count growing in the host like this:

[root@llmvm02 bharata]# stat /mnt/file
File: `/mnt/file’
Size: 471859200     Blocks: 846880     IO Block: 131072 regular file

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 917288     IO Block: 4096   regular file

Remove the file and this should generate discard requests which should get passed down to host eventually resulting in the discarded blocks getting released from the underlying loop device in the host.

[root@F17-kvm ~]# rm -f /guest-mnt/file

In the host,

[root@llmvm02 bharata]# stat /mnt/file
File: `/mnt/file’
Size: 471859200     Blocks: 46600      IO Block: 131072 regular file

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 112960     IO Block: 4096   regular file

Thus we saw the discard requests from the guest eventually resulting in the blocks getting released in the host block device. I was using a loop device in the host, but this should work for any SCSI device that supports UNMAP. It is not necessary to have SCSI device with UNMAP support to get the benefit of space saving. In fact if the underlying device is a thin logical volume(dm-thin) coming from a thin pool dm device, the  space saving can be realized at the thin pool level itself.

Concerns with discard mount option

There have been concerns about the cost of UNMAP operation in the storage and its detrimental effect on the IO throughput. So it is unclear if everyone will want to turn on the discard mount option by default. I wish I had access to an UNMAP-capable storage array to really test the effect of UNMAP on IO performance.


QEMU-GlusterFS native integration

October 29, 2012

GlusterFS is a distributed file system implemented in user space. It is strictly not a native file system in itself but is an aggregator of different file systems. GlusterFS can aggregate individual file system mount points or directories (called bricks in gluster terminology) to provide a single unified file system namespace. In addition to NFS and CIFS, the most common
way to access GlusterFS namespace is via FUSE based Gluster native client.

gluster

More information on creating and mounting GlusterFS volume can be obtained from GlusterFS website.

GlusterFS for virtualization

Until recently using GlusterFS volumes to host VM images and data was sub-optimal due to the FUSE overhead involved in accessing gluster volumes via GlusterFS native client. However this has changed now with two specific enhancements:

– A new library called libgfapi is now available as part of GlusterFS that  provides POSIX-like C APIs for accessing gluster volumes. libgfapi support will be available from GlusterFS-3.4 release.
– QEMU (starting from QEMU-1.3) will have GlusterFS block driver that uses libgfapi and hence there is no FUSE overhead any longer when QEMU works with VM images on gluster volumes.

GlusterFS with its pluggable translator model can serve as a flexible storage backend for QEMU. QEMU has to just talk to GlusterFS and GlusterFS will hide different file systems and storage types underneath. Various GlusterFS storage features like replication and striping will automatically be available for QEMU. Efforts are also on to add block device backend in Gluster via Block Device (BD) translator that will expose underlying block devices as files to QEMU. This allows GlusterFS to be a single storage backend for both file and block based storage types.

QEMU with native GlusterFS block driver

QEMU with native GlusterFS block driver

GlusterFS specifcation in QEMU

VM image residing on gluster volume can be specified on QEMU command line using URI format:

gluster[+transport]://[server[:port]]/volname/image[?socket=…]

gluster is the protocol.

transport specifies the transport type used to connect to gluster management daemon (glusterd). Valid transport types are tcp, unix and rdma. If a transport type isn’t specified, then tcp type is assumed.

server specifies the server where the volume file specification for the given volume resides. This can be either hostname, ipv4 address or ipv6 address. ipv6 address needs to be within square brackets [ ]. If transport type is unix, then server field should not be specified. Instead the socket field needs to be populated with the path to unix domain socket.

port is the port number on which glusterd is listening. This is optional and if not specified, QEMU will send 0 which will make gluster to use the default port. If the transport type is unix, then port should not be specified.

volname is the name of the gluster volume which contains the VM image.

image is the path to the actual VM image that resides on gluster volume.

Examples:

gluster://1.2.3.4/testvol/a.img
gluster+tcp://1.2.3.4/testvol/a.img
gluster+tcp://1.2.3.4:24007/testvol/dir/a.img
gluster+tcp://[1:2:3:4:5:6:7:8]/testvol/dir/a.img
gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img
gluster+tcp://server.domain.com:24007/testvol/dir/a.img
gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket
gluster+rdma://1.2.3.4:24007/testvol/a.img

(GlusterFS URI description and above examples are taken from QEMU documentation)

Configuring QEMU with GlusterFS backend

While building QEMU from source, in addition to the normal configuration options, ensure that –enable-uuid and –enable-glusterfs options are is specified explicitly with ./configure script. (Update Feb 2013: A fix in QEMU-1.3 time frame makes the use of –enable-uuid unnecessary for GlusterFS support in QEMU)

Update Aug 2013: Starting with QEMU-1.6, pkg-config is used to configure the GlusterFS backend in QEMU. If you are using GlusterFS compiled and installed from sources, then the GlusterFS package config file (glusterfs-api.pc) might not be present at the standard path and you will have to explicitly add the path by executing this command before running the QEMU configure script:

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig/

Without this, GlusterFS driver will not be compiled into QEMU even when GlusterFS is present in the system.

Creating a VM image on GlusterFS backend

qemu-img command can be used to create VM images on gluster backend. The general syntax for image creation looks like this:

qemu-img create gluster://server/volname/path/to/image size

Example:

To create a raw image, qemu-img create gluster://1.2.3.4/testvol/dir/a.img 5G
To create a qcow2 image, qemu-img create -f qcow2 gluster://server.domain.com:24007/testvol/a.img 5G

Booting VM image from GlusterFS backend

A VM image a.img residing on gluster volume testvol can be booted using QEMU like this:

qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio

In addition to VM images, gluster drives can also be used as data drives:

qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio -drive file=gluster://1.2.3.4/datavol/a-data.img,if=virtio

Here a-data.img from datavol gluster volume appears as a 2nd drive for the guest.

Performance numbers

The following numbers from FIO benchmark are to show the performance advantage of using QEMU’s GlusterFS block driver instead of the usual FUSE mount while accessing the VM image.

Test setup

Host Dual core x86_64 system running Fedora 17 kernel (3.5.6-1.fc17.x86_64)
Guest Fedora 17 image, 4 way SMP, 2GB RAM, using virtio and cache=none QEMU options

QEMU options

FUSE mount qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=/mnt/F17,if=virtio,cache=none => /mnt is GlusterFS FUSE mount point
GlusterFS block driver in QEMU (FUSE bypass) qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=gluster://bharata/test/F17,if=virtio,cache=none
Base (VM image accessed directly from brick) qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=/test/F17,if=virtio,cache=none => /test is brick directory

FIO load files

Sequential read direct IO ; Read 4 files with aio at different depths
[global]
ioengine=libaio
direct=1
rw=read
bs=128k
size=512m
directory=/data1
[file1]
iodepth=4
[file2]
iodepth=32
[file3]
iodepth=8
[file4]
iodepth=16
Sequential write direct IO ; Write 4 files with aio at different depths
[global]
ioengine=libaio
direct=1
rw=write
bs=128k
size=512m
directory=/data1
[file1]
iodepth=4
[file2]
iodepth=32
[file3]
iodepth=8
[file4]
iodepth=16


FIO READ numbers

aggrb (KB/s) minb (KB/s) maxb (KB/s)
FUSE mount 15219 3804 5792
QEMU’s GlusterFS block driver (FUSE bypass) 39357 9839 12946
Base 43802 10950 12918

FIO WRITE numbers

aggrb (KB/s) minb (KB/s) maxb (KB/s)
FUSE mount 24579 6144 8423
QEMU’s GlusterFS block driver (FUSE bypass) 42707 10676 17262
Base 42393 10598 15646

Updated numbers

Here are the recent FIO numbers averaged from 5 runs using latest QEMU (git commit: 03a36f17d77) and GlusterFS (git commit: cee1b62d01). The test environment remains same as above with the following two changes:

  • The GlusterFS volume has write-behind translator turned off
  • The host kernel is upgraded to 3.6.7-4.fc17.x86_64

FIO READ numbers

aggrb (KB/s) % Reduction from Base
Base 44464 0
FUSE mount 21637 -51
QEMU’s GlusterFS block driver (FUSE bypass) 38847 -12.6

FIO WRITE numbers

aggrb (KB/s) % Reduction from Base
Base 45824 0
FUSE mount 40919 -10.7
QEMU’s GlusterFS block driver (FUSE bypass) 45627 -0.43

GlusterFS support in oVirt

While I described how to use GlusterFS as a storage backend for QEMU manually, there have been efforts to enable QEMU-GlusterFS native support from libvirt, VDSM and oVirt as well. We now have GlusterFS enabled completely from oVirt which allows user to use self-help portal of oVirt to create GlusterFS volume and use it as storage backend to host VM images. The GlusterFS storage domain work in VDSM and the enablement of the same from oVirt allows oVirt to exploit the QEMU-GlusterFS native integration rather than using FUSE for accessing GlusterFS volume.

Deepak C Shetty has created a nice video demo of how to use oVirt to create a GlusterFS storage domain and boot VMs off it.

UNMAP/Discard support in QEMU-GlusterFS

UNMAP support in QEMU-GlusterFS is explained here.