QEMU-GlusterFS native integration

GlusterFS is a distributed file system implemented in user space. It is strictly not a native file system in itself but is an aggregator of different file systems. GlusterFS can aggregate individual file system mount points or directories (called bricks in gluster terminology) to provide a single unified file system namespace. In addition to NFS and CIFS, the most common
way to access GlusterFS namespace is via FUSE based Gluster native client.

gluster

More information on creating and mounting GlusterFS volume can be obtained from GlusterFS website.

GlusterFS for virtualization

Until recently using GlusterFS volumes to host VM images and data was sub-optimal due to the FUSE overhead involved in accessing gluster volumes via GlusterFS native client. However this has changed now with two specific enhancements:

– A new library called libgfapi is now available as part of GlusterFS that  provides POSIX-like C APIs for accessing gluster volumes. libgfapi support will be available from GlusterFS-3.4 release.
– QEMU (starting from QEMU-1.3) will have GlusterFS block driver that uses libgfapi and hence there is no FUSE overhead any longer when QEMU works with VM images on gluster volumes.

GlusterFS with its pluggable translator model can serve as a flexible storage backend for QEMU. QEMU has to just talk to GlusterFS and GlusterFS will hide different file systems and storage types underneath. Various GlusterFS storage features like replication and striping will automatically be available for QEMU. Efforts are also on to add block device backend in Gluster via Block Device (BD) translator that will expose underlying block devices as files to QEMU. This allows GlusterFS to be a single storage backend for both file and block based storage types.

QEMU with native GlusterFS block driver

QEMU with native GlusterFS block driver

GlusterFS specifcation in QEMU

VM image residing on gluster volume can be specified on QEMU command line using URI format:

gluster[+transport]://[server[:port]]/volname/image[?socket=…]

gluster is the protocol.

transport specifies the transport type used to connect to gluster management daemon (glusterd). Valid transport types are tcp, unix and rdma. If a transport type isn’t specified, then tcp type is assumed.

server specifies the server where the volume file specification for the given volume resides. This can be either hostname, ipv4 address or ipv6 address. ipv6 address needs to be within square brackets [ ]. If transport type is unix, then server field should not be specified. Instead the socket field needs to be populated with the path to unix domain socket.

port is the port number on which glusterd is listening. This is optional and if not specified, QEMU will send 0 which will make gluster to use the default port. If the transport type is unix, then port should not be specified.

volname is the name of the gluster volume which contains the VM image.

image is the path to the actual VM image that resides on gluster volume.

Examples:

gluster://1.2.3.4/testvol/a.img
gluster+tcp://1.2.3.4/testvol/a.img
gluster+tcp://1.2.3.4:24007/testvol/dir/a.img
gluster+tcp://[1:2:3:4:5:6:7:8]/testvol/dir/a.img
gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img
gluster+tcp://server.domain.com:24007/testvol/dir/a.img
gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket
gluster+rdma://1.2.3.4:24007/testvol/a.img

(GlusterFS URI description and above examples are taken from QEMU documentation)

Configuring QEMU with GlusterFS backend

While building QEMU from source, in addition to the normal configuration options, ensure that –enable-uuid and –enable-glusterfs options are is specified explicitly with ./configure script. (Update Feb 2013: A fix in QEMU-1.3 time frame makes the use of –enable-uuid unnecessary for GlusterFS support in QEMU)

Update Aug 2013: Starting with QEMU-1.6, pkg-config is used to configure the GlusterFS backend in QEMU. If you are using GlusterFS compiled and installed from sources, then the GlusterFS package config file (glusterfs-api.pc) might not be present at the standard path and you will have to explicitly add the path by executing this command before running the QEMU configure script:

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig/

Without this, GlusterFS driver will not be compiled into QEMU even when GlusterFS is present in the system.

Creating a VM image on GlusterFS backend

qemu-img command can be used to create VM images on gluster backend. The general syntax for image creation looks like this:

qemu-img create gluster://server/volname/path/to/image size

Example:

To create a raw image, qemu-img create gluster://1.2.3.4/testvol/dir/a.img 5G
To create a qcow2 image, qemu-img create -f qcow2 gluster://server.domain.com:24007/testvol/a.img 5G

Booting VM image from GlusterFS backend

A VM image a.img residing on gluster volume testvol can be booted using QEMU like this:

qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio

In addition to VM images, gluster drives can also be used as data drives:

qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio -drive file=gluster://1.2.3.4/datavol/a-data.img,if=virtio

Here a-data.img from datavol gluster volume appears as a 2nd drive for the guest.

Performance numbers

The following numbers from FIO benchmark are to show the performance advantage of using QEMU’s GlusterFS block driver instead of the usual FUSE mount while accessing the VM image.

Test setup

Host Dual core x86_64 system running Fedora 17 kernel (3.5.6-1.fc17.x86_64)
Guest Fedora 17 image, 4 way SMP, 2GB RAM, using virtio and cache=none QEMU options

QEMU options

FUSE mount qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=/mnt/F17,if=virtio,cache=none => /mnt is GlusterFS FUSE mount point
GlusterFS block driver in QEMU (FUSE bypass) qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=gluster://bharata/test/F17,if=virtio,cache=none
Base (VM image accessed directly from brick) qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=/test/F17,if=virtio,cache=none => /test is brick directory

FIO load files

Sequential read direct IO ; Read 4 files with aio at different depths
[global]
ioengine=libaio
direct=1
rw=read
bs=128k
size=512m
directory=/data1
[file1]
iodepth=4
[file2]
iodepth=32
[file3]
iodepth=8
[file4]
iodepth=16
Sequential write direct IO ; Write 4 files with aio at different depths
[global]
ioengine=libaio
direct=1
rw=write
bs=128k
size=512m
directory=/data1
[file1]
iodepth=4
[file2]
iodepth=32
[file3]
iodepth=8
[file4]
iodepth=16


FIO READ numbers

aggrb (KB/s) minb (KB/s) maxb (KB/s)
FUSE mount 15219 3804 5792
QEMU’s GlusterFS block driver (FUSE bypass) 39357 9839 12946
Base 43802 10950 12918

FIO WRITE numbers

aggrb (KB/s) minb (KB/s) maxb (KB/s)
FUSE mount 24579 6144 8423
QEMU’s GlusterFS block driver (FUSE bypass) 42707 10676 17262
Base 42393 10598 15646

Updated numbers

Here are the recent FIO numbers averaged from 5 runs using latest QEMU (git commit: 03a36f17d77) and GlusterFS (git commit: cee1b62d01). The test environment remains same as above with the following two changes:

  • The GlusterFS volume has write-behind translator turned off
  • The host kernel is upgraded to 3.6.7-4.fc17.x86_64

FIO READ numbers

aggrb (KB/s) % Reduction from Base
Base 44464 0
FUSE mount 21637 -51
QEMU’s GlusterFS block driver (FUSE bypass) 38847 -12.6

FIO WRITE numbers

aggrb (KB/s) % Reduction from Base
Base 45824 0
FUSE mount 40919 -10.7
QEMU’s GlusterFS block driver (FUSE bypass) 45627 -0.43

GlusterFS support in oVirt

While I described how to use GlusterFS as a storage backend for QEMU manually, there have been efforts to enable QEMU-GlusterFS native support from libvirt, VDSM and oVirt as well. We now have GlusterFS enabled completely from oVirt which allows user to use self-help portal of oVirt to create GlusterFS volume and use it as storage backend to host VM images. The GlusterFS storage domain work in VDSM and the enablement of the same from oVirt allows oVirt to exploit the QEMU-GlusterFS native integration rather than using FUSE for accessing GlusterFS volume.

Deepak C Shetty has created a nice video demo of how to use oVirt to create a GlusterFS storage domain and boot VMs off it.

UNMAP/Discard support in QEMU-GlusterFS

UNMAP support in QEMU-GlusterFS is explained here.

30 Responses to QEMU-GlusterFS native integration

  1. SungWoo Jo says:

    Thank you ! very helpful for me.
    But, error happened…

    [root@dev156 gluster]# qemu-img create gluster://172.21.80.156/testVol/vol1 5G
    Formatting ‘gluster://172.21.80.156/testVol/vol1’, fmt=gluster size=5368709120
    qemu-img: gluster://172.21.80.156/testVol/vol1: error while creating gluster: No such file or directory

    my qemu version is 1.2.5 and gluster version is 3git latest(Oct 30).
    are these versions right?

    • Bharata says:

      SungWoo,

      – Not sure about 1.2.5 version of QEMU, the latest I see is 1.2 and a few rc versions. Can you please test with latest QEMU git tree ?
      – Looks like latest GlusterFS has problems and I am also seeing similar errors as you (glfs_creat in libgfapi is failing). I will debug this further and get back to you.

      • SungWoo Jo says:

        Bharata.
        I downloaded qemu source from github( https://github.com/qemu/qemu.git ) and branch is master.
        I tried to find QEMU-1.3 you said, but I couldn’t find.
        Where can I download QEMU-1.3?
        I will also debug my environment ~.
        thanks your reply.

      • Bharata says:

        QEMU-1.3 is not yet out and hence latest QEMU git from the URL you point out is good enough. As I said I am seeing some problems with latest GlusterFS which I need to debug. Will get back to you after that.

      • Bharata says:

        SungWoo, Please use –enable-uuid (and also –enable-glusterfs) when you configure QEMU. This will ensure that “qemu-img create …” works correctly.

      • SungWoo Jo says:

        Bharata. you’re right.
        I omitted -enable-uuid because of configuration error.
        I install libuuid-devel and configure again with enable-uuid.
        thank you very much.

  2. Deepak C Shetty says:

    It would be good to add a short summary after the FIO numbers table, so that readers don’t have to calculate how much the improvement is in their minds :). It also helps people ina hurry to quickly read the summary instead of deciphering the numbers 🙂 – just a thought. Thanks deepak.

  3. (Not sure I can comment after many months but…)

    What if I want to execute qemu-img or qemu-system-* gluster:// *as non root user*?

    I tried to fusemount volume and change permissions/acl/ownership on mounted dirs, or even to change them in /export/brick, but it doesn’t work.

    There’s a way to “authenticate” via gfapi?

    Am I supposed to necessarily run qemu as root if I want this feature?

    Thanks for help 🙂

    $ qemu-img create gluster://localhost/gv0/QEMU/a.img 20G
    Formatting ‘gluster://localhost/gv0/QEMU/a.img’, fmt=raw size=21474836480
    qemu-img: Gluster connection failed for server=localhost port=0 volume=gv0 image=QEMU/a.img transport=tcp
    qemu-img: gluster://localhost/gv0/QEMU/a.img: error while creating raw: No data available

    • Bharata says:

      If you want to run QEMU-libgfapi as non-root, you need the following two settings:

      1. volume set server.allow-insecure on
      2. option rpc-auth-allow-insecure on in glusterd.vol

      • Well, thanks. I wanted to run qemu as non-root for security reasons… but setting “allow-insecure” means that I am completely missing my goal, isn’t it?

        Well, perhaps I’ve found a better solution. I’m still experimentig with it, so the usual disclaimer applies 🙂 And any comment is welcome.

        First, I can certainly run qemu-img as root with no worries: it’s a command I run just once, not a long-standing process – heavily accessing system resources – like QEMU itself.

        Then, I can fusemount the gluster volume and change ownership of the image. (Possibly chmod’ing it to 0700).

        (At this point, I can safely unmount gluster-fuse, I don’t strictly need it anymore).

        Finally, I run qemu as root, but with the -runas option: which means that qemu initially acts as root to talk to the gluster daemon, but later it loses superuser privileges and it’s still able to write to the image it owns.

        Guido

  4. (Er, the image mode could be 0600 of course, not 0700, no need to be executable 😉 sorry).

  5. Hi, still interested/involed in the subject 🙂

    The problem is: if I run a qemu vm with gluster:// , and I add or replace a brick to the gluster volume while qemu is still running, qemu crashes 😦

    Each time…

    This doesn’t happen if i run QEMU on a FUSE mount!

    Is this – somewhat – “by design” (i.e. If I want such optimization, the good practice is to shutdown VM before each brick management operation) ?

    Is this a QEMU bug? A Gluster (or, more specifically, a libgfapi) bug?

    Thanks!

    • Bharata says:

      Last I checked, there were a few issues related to replace-brick being done with libgfapi. Can you please take this discussion to gluster-devel mailing list so that we can hear the latest on that ?

  6. SATHEESARAN says:

    Hi there,

    I have successfully created image using qemu-img info on gluster volume, using this qemu glusterfs driver.

    But I need to use this image to create a VM. I see in your example, that you use qemu-system-x86_64, but usually I use “virt-install” python wrapper. I need how to make use of this image using virt-install command.

    Please provide me the info.

    Thanks

  7. Hi,

    Thanks for the write up.
    Could you elaborate on your setup?

    How many hosts?
    How many disks? What kind of disks?
    Network bandwidth?
    Gluster setup?

    Thanks!

    • Bharata says:

      Sebastien – I am only comparing FUSE vs libgfapi FIO numbers for which I have given the setup information in the blog itself.

      • The only information related to the setup are:

        Host: Dual core x86_64 system running Fedora 17 kernel (3.5.6-1.fc17.x86_64)

        So you’ve done everything locally?

        However you never specified the individual disk speed, so we can’t really interpret your results.

      • Bharata says:

        Yes, I have done the measurements locally. These numbers are just to show how libgfapi scores over FUSE method in a given setup. The environment is constant for both the runs (libgfapi and FUSE) and hence haven’t bothered to provide more details

  8. SATHEESARAN says:

    Hi Bharata,
    So, is there a way to delete the image that is already available on gluster volume, without mounting the volume, either by fuse or NFS ?

    • SATHEESARAN says:

      To add to my question, I have created a image using qemu glusterfs driver, but I couldn’t find a way to delete it

      • Bharata says:

        Like all other block drivers in QEMU, GlusterFS driver too supports only creation of file on the (GlusterFS) backend. Deletion should be handled outside of QEMU which means it should be done via NFS or FUSE for GlusterFS.

    • Deepak C Shetty says:

      Satheesaran,
      I believe your Q is wrong. Its like asking how i can use qemu to delete a file if the file was present on the gluster mount point. You cannot do that but yes since its a gluster mount, u can do a ‘rm’ to delete the file. Similarly in ur case its accessed via libgfapi, so you can write a small program that links with libgfapi and sends a ‘unlink’ fop to the file u r interested to delete. Qemu libgfapi integration isn’t meant to provide way to manage files on glusterfs, its a way to access files so that files on gluster volume can be used as vm disk images.

      If you Q was how to remove a vmdisk (backed by glusterfs usign libgfapi) from the VM’s config, you can use virsh detach-disk or detach-device. Again it only removes the gluster-backed-disk from the VM’s config, not from the glusterfs backend.

      HTH – Deepak

      • SATHEESARAN says:

        Thanks Deepak for explaining. I was looking for a way to unlink a file without mounting it. That was my intention behind this question.

  9. Deepak C Shetty says:

    Satheesaran,
    To add few other perspectives….

    1) If qemu-img command supported a ‘delete’ option, the same could have been extended to glusterfs and deletion would have been possible, but qemu-img supports ‘create’ option but not ‘delete’ 🙂 There must be some history to why they don’t support it, otherwise I dont see a reason why it can’t be supported! My guess is that qemu-img create was supported to be able for users to create qcow2 formatted files since its format is privy to qemu only. Deleting files is just a ‘rm’ operation, so there probably wasn’t a need to support ‘delete’ option in qemu-img. Even if u add ‘delete’ op to qemu-img it still won’t fit in a typical setup as explained in #2 below. My 2 cents! 🙂

    2) Typically file create/delete is a storage backend operation while qemu usign a file is a virt operation which are managed by storage and virt admins resp. who have separate roles, resp and permissions in a production setup. Even in openstack, we use cinder to create/delete volumes (files on glsuter backend) and then attach / detach the volumes to nova (VM/instance). Allowing qemu to create/delete files on storage backend is like mixing the virt and storage admin roles/resp./permissions which can cause issues. They are best when kept separate. For eg: In openstack setup, cinder gluster backend is setup for access only for cinder:cinder but qemu process (i.e nova instance) will be running qemu:qemu creds and thus won’t have perms anyways to create/delete files on glsuter backend which is set to cinder:cinder. This is done deliberately so that any file create/delete only happens via cinder as it knows best how to deal with gluster (or any other stoarge) backend.

    – Deepak

  10. kam270Dan says:

    Bharata, may seem like a simple question but how did you compile QUEMU. Did you pass any special parameters. Can you please provide the command you used.

    Thanks

  11. Eric says:

    That improvement in speed is very impressive. I knew I wasn’t a fan of FUSE, but didn’t know by how much. Now I do.

  12. […] long after I started using oVirt and Gluster together, the projects started talking about a way to improve Gluster performance by enabling virtualization hosts to access Gluster volumes directly, using Gluster’s […]

Leave a comment