Troubleshooting QEMU-GlusterFS

August 21, 2013

As described in my previous blog post, QEMU supports talking to GlusterFS using libgfapi which is a much better way to use GlusterFS to host VM images than using the FUSE mount to access GlusterFS volumes. However due to some bugs that exist in GlusterFS-3.4, any invalid specification of GlusterFS drive on QEMU command line can result in completely non-obvious error messages from QEMU. The aim of this blog post is to document some of these known error scenarios in QEMU-GlusterFS so that users can figure out what could be potentially wrong in their QEMU-GlusterFS setup.

As of this writing (Aug 2013), I know about two bugs in GlusterFS that cause all this confusion in the failure messages:

  • A bug that results in GlusterFS log messages not reaching stderr because of which no meaningful failure messages are shown from QEMU. This bug has been fixed, and it should eventually make it to some release (probably 3.4.1) of GlusterFS.
  • A bug in a libgfapi routine called glfs_init() that results in returning failure with errno set to 0. QEMU depends on errno here and this leads to unpleasant segmentation faults in QEMU.

Environment

I am using Fedora 19 with distro-provided QEMU and GlusterFS rpms for all the experiments here.

# rpm -qa | grep qemu
qemu-system-x86-1.4.2-5.fc19.x86_64
ipxe-roms-qemu-20130517-2.gitc4bce43.fc19.noarch
qemu-kvm-1.4.2-5.fc19.x86_64
qemu-guest-agent-1.4.2-3.fc19.x86_64
libvirt-daemon-driver-qemu-1.0.5.1-1.fc19.x86_64
qemu-img-1.4.2-5.fc19.x86_64
qemu-common-1.4.2-5.fc19.x86_64

# rpm -qa | grep gluster
glusterfs-3.4.0-8.fc19.x86_64
glusterfs-libs-3.4.0-8.fc19.x86_64
glusterfs-fuse-3.4.0-8.fc19.x86_64
glusterfs-server-3.4.0-8.fc19.x86_64
glusterfs-api-3.4.0-8.fc19.x86_64
glusterfs-debuginfo-3.4.0-8.fc19.x86_64
glusterfs-cli-3.4.0-8.fc19.x86_64

Error scenarios

1. glusterd service not started

On Fedora 19, glusterd service fails to start during boot and if you don’t notice it, you can end up using QEMU with GlusterFS when glusterd service isn’t running. In this scenario, you will encounter the following kind of failure:

[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test image=F17-qcow2 transport=tcp
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: could not open disk image gluster://kvm-gluster/test/F17-qcow2: No data available

2. GlusterFS volume not started

If QEMU is used to boot a VM image on GlusterFS volume which is not in started state, the following kind of error is seen:

[root@localhost ~]# gluster volume status test
Volume test is not started
[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test image=F17-qcow2 transport=tcp
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: could not open disk image gluster://kvm-gluster/test/F17-qcow2: No data available

3. QEMU run as non-root user

As of this writing (Aug 2013), GlusterFS doesn’t support non-root users to access GlusterFS volumes via libgfapi without some manual settings. This is a very typical scenario for most users who don’t use QEMU directly but use it via libvirt and oVirt. In this scenario, the following kind of error is seen:

[bharata@localhost ~]$ qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test image=F17-qcow2 transport=tcp
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: could not open disk image gluster://kvm-gluster/test/F17-qcow2: No data available

As you can see, 3 different failure scenarios produced similar error messages. This will improve once error logs from GlusterFS start reaching QEMU when GlusterFS with the fix is used.

4. Specifying non-existing GlusterFS volume or server

Specifying invalid or wrong server or volume names can cause nasty failures like this:

[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test-xxx/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test-xxx/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test-xxx image=F17-qcow2 transport=tcp
Segmentation fault (core dumped)

[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster-xxx/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster-xxx/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster-xxx port=0 volume=test image=F17-qcow2 transport=tcp
Segmentation fault (core dumped)

This should get fixed when the bug in libgfapi is resolved.


UNMAP/DISCARD support in QEMU-GlusterFS

August 7, 2013

In my last blog post on QEMU-GlusterFS, I described the integration of QEMU with GlusterFS using libgfapi. In this post, I give an overview of the recently added discard support to QEMU’s GlusterFS back-end and how it can be used. Newer SCSI devices support UNMAP command that is used to return the unused/freed blocks back to the storage. This command is typically useful when the storage is thin provisioned like a thin provisioned SCSI LUN. In response to a file deletion, the host device driver could send down an UNMAP command to the SCSI target and instruct it to free the relevant blocks from the thin provisioned LUN. This leads to much better utilization of the storage.

Linux support for discard

In Linux, SCSI UNMAP is supported via the generic discard framework which I believe is also used to support ATA TRIM command. ATA TRIM command typically used in SSD isn’t the topic of discussion of this blog. There are multiple ways in which discard functionality is invoked or used in Linux.

  • For direct block devices, one could use BLKDISCARD ioctl to release the unused blocks.
  • File systems like EXT4 support a file level discard using FALLOC_FL_PUNCH_HOLE option of fallocate system call.
  • For releasing the unused blocks at the file systems level, fstrim command can be used.
  • Finally, file systems like EXT4 also support ‘discard’ mount option that will control if file system (EXT4) should issue UNMAP requests to the underlying block device when there are free blocks.

QEMU support for discard

UNMAP is primarily useful in two ways for KVM virtualization.

  • When a file is deleted in the VM, the resulting UNMAP in the guest is passed down to host which will result in host sending the discard request to the thin provisioned SCSI device. This results in the blocks consumed by the deleted file to be returned back to the SCSI storage. The effect is same when there is an explicit discard request from the guest using either ioctl or fallocate methods listed in the previous section.
  • When a VM image is deleted, there is a potential to return the freed blocks back to the storage by sending the UNMAP command to the SCSI storage.

Guest UNMAP requests will end up in QEMU only if the guest is using scsi or virtio-scsi device and not virtio-blk device. QEMU will forward this request further down to the host (device or file system) only if ‘discard=on’ or ‘discard=unmap’ drive flag is used for the device on the QEMU command line.

Example1: qemu-system-x86_64 -drive file=/images/vm.img,if=scsi,discard=on
Example2: qemu-system-x86_64 -device virtio-scsi-pci -drive if=none,discard=on,id=rootdisk,file=gluster://host/volume/image -device scsi-hd,drive=rootdisk

The way discard request is further passed down in the host is determined by the block driver inside QEMU which is serving the disk image type. While QEMU uses fallocate(FALLOC_FL_PUNCH_HOLE) for raw file backends and ioctl(BLKDISCARD) for block device back-end, other backends use their own interfaces to pass down the discard request.

GlusterFS support for discard

GlusterFS, starting from version 3.4, supports discard functionality through a new API called glfs_discard(glfs_fd, offset, size) that is available as part of libgfapi library.  On the GlusterFS server side, discard request is handled differently for posix back-end and Block Device(BD) back-end.

For the posix back-end, fallocate(FALLOC_FL_PUNCH_HOLE) is used to eventually release the blocks to the filesystem. If the posix brick has been mounted with ‘-o discard’ option, then the discard request will eventually reach the SCSI storage if the storage device supports UNMAP.

Support for BD back-end is planned to come up as soon as the ongoing development work on the newer and feature rich BD translator becomes upstream. BD translator should be using ioctl(BLKDISCARD) to UNMAP the blocks.

Discard support in GlusterFS back-end of QEMU

As described in my earlier blog, QEMU starting from version 1.3 supports GlusterFS back-end using libgfapi which is the non-FUSE way of accessing GlusterFS volumes. Recently I added discard support to GlusterFS block driver in QEMU and this support should be available in QEMU-1.6 onwards. This work involved using the glfs_discard() API from QEMU GlusterFS driver to send the discard request to GlusterFS server.

With this, the entire KVM virtualization stack using GlusterFS back-end is enabled to take advantage of SCSI UNMAP command.

Typical usage

This section describes a typical use case with QEMU-GlusterFS where a file deleted from inside the VM results in discard requests for the host storage.

Step 1: Prepare a block device that supports UNMAP

Since I don’t have a real SCSI device that supports UNMAP, I am going to use a loop device to host my GlusterFS volume. Loop device supports discard.

[root@llmvm02 bharata]# dd if=/dev/zero of=discard-loop count=1024 bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.63255 s, 1.7 GB/s

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 2097152    IO Block: 4096   regular file
Device: fc13h/64531d    Inode: 15740290    Links: 1

[root@llmvm02 bharata]# losetup /dev/loop0 discard-loop

[root@llmvm02 bharata]# cat /sys/block/loop0/queue/discard_max_bytes
4294966784

Step 2: Prepare the brick directory for GlusterFS volume

[root@llmvm02 bharata]# mkfs.ext4 /dev/loop0
[root@llmvm02 bharata]# mount -o discard /dev/loop0 /discard-mnt/
[root@llmvm02 bharata]# mount  | grep discard
/dev/loop0 on /discard-mnt type ext4 (rw,discard)

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 66368      IO Block: 4096   regular file

Step 3: Create a GlusterFS volume

[root@llmvm02 bharata]# gluster volume create discard llmvm02:/discard-mnt/ force
volume create: discard: success: please start the volume to access data

[root@llmvm02 bharata]# gluster volume start discard
volume start: discard: success

[root@llmvm02 bharata]# gluster volume info discard
Volume Name: discard
Type: Distribute
Volume ID: ed7a6f8a-9cb8-463d-a948-61974cb64c99
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: llmvm02:/discard-mnt

Step 4: Create a sparse file in the GlusterFS volume

[root@llmvm02 bharata]# glusterfs -s llmvm02 –volfile-id=discard /mnt
[root@llmvm02 bharata]# touch /mnt/file
[root@llmvm02 bharata]# truncate -s 450M /mnt/file

[root@llmvm02 bharata]# stat /mnt/file
File: `/mnt/file’
Size: 471859200     Blocks: 0          IO Block: 131072 regular file

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 66368      IO Block: 4096   regular file

Step 5: Use this sparse file on GlusterFS volume as a disk drive with QEMU

[root@llmvm02 bharata]# qemu-system-x86_64 –enable-kvm -nographic -m 8192 -smp 2 -device virtio-scsi-pci -drive if=none,cache=none,id=F17,file=gluster://llmvm02/test/F17 -device scsi-hd,drive=F17 -drive if=none,cache=none,id=gluster,discard=on,file=gluster://llmvm02/discard/file -device scsi-hd,drive=gluster -kernel /home/bharata/linux-2.6-vm/kernel1 -initrd /home/bharata/linux-2.6-vm/initrd1 -append “root=UUID=d29b972f-3568-4db6-bf96-d2702ec83ab6 ro rd.md=0 rd.lvm=0 rd.dm=0 SYSFONT=True KEYTABLE=us rd.luks=0 LANG=en_US.UTF-8 console=tty0 console=ttyS0 selinux=0”

The file appears as a SCSI disk in the guest.

[root@F17-kvm ~]# dmesg | grep -i sdb
sd 0:0:1:0: [sdb] 921600 512-byte logical blocks: (471 MB/450 MiB)
sd 0:0:1:0: [sdb] Write Protect is off
sd 0:0:1:0: [sdb] Mode Sense: 63 00 00 08
sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn’t support DPO or FUA

Step 6: Generate discard requests from the guest

Format a FS on the SCSI disk, mount it and create a file on it

[root@F17-kvm ~]# mkfs.ext4 /dev/sdb
[root@F17-kvm ~]# mount -o discard /dev/sdb /guest-mnt/
[root@F17-kvm ~]# dd if=/dev/zero of=/guest-mnt/file bs=1M count=400
400+0 records in
400+0 records out
419430400 bytes (419 MB) copied, 1.07204 s, 391 MB/s

Now we can see the blocks count growing in the host like this:

[root@llmvm02 bharata]# stat /mnt/file
File: `/mnt/file’
Size: 471859200     Blocks: 846880     IO Block: 131072 regular file

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 917288     IO Block: 4096   regular file

Remove the file and this should generate discard requests which should get passed down to host eventually resulting in the discarded blocks getting released from the underlying loop device in the host.

[root@F17-kvm ~]# rm -f /guest-mnt/file

In the host,

[root@llmvm02 bharata]# stat /mnt/file
File: `/mnt/file’
Size: 471859200     Blocks: 46600      IO Block: 131072 regular file

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 112960     IO Block: 4096   regular file

Thus we saw the discard requests from the guest eventually resulting in the blocks getting released in the host block device. I was using a loop device in the host, but this should work for any SCSI device that supports UNMAP. It is not necessary to have SCSI device with UNMAP support to get the benefit of space saving. In fact if the underlying device is a thin logical volume(dm-thin) coming from a thin pool dm device, the  space saving can be realized at the thin pool level itself.

Concerns with discard mount option

There have been concerns about the cost of UNMAP operation in the storage and its detrimental effect on the IO throughput. So it is unclear if everyone will want to turn on the discard mount option by default. I wish I had access to an UNMAP-capable storage array to really test the effect of UNMAP on IO performance.