QEMU-GlusterFS native integration

GlusterFS is a distributed file system implemented in user space. It is strictly not a native file system in itself but is an aggregator of different file systems. GlusterFS can aggregate individual file system mount points or directories (called bricks in gluster terminology) to provide a single unified file system namespace. In addition to NFS and CIFS, the most common
way to access GlusterFS namespace is via FUSE based Gluster native client.

More information on creating and mounting GlusterFS volume can be obtained from GlusterFS website.

GlusterFS for virtualization

Until recently using GlusterFS volumes to host VM images and data was sub-optimal due to the FUSE overhead involved in accessing gluster volumes via GlusterFS native client. However this has changed now with two specific enhancements:

– A new library called libgfapi is now available as part of GlusterFS that provides POSIX-like C APIs for accessing gluster volumes. libgfapi support will be available from GlusterFS-3.4 release.
– QEMU (starting from QEMU-1.3) will have GlusterFS block driver that uses libgfapi and hence there is no FUSE overhead any longer when QEMU works with VM images on gluster volumes.

GlusterFS with its pluggable translator model can serve as a flexible storage backend for QEMU. QEMU has to just talk to GlusterFS and GlusterFS will hide different file systems and storage types underneath. Various GlusterFS storage features like replication and striping will automatically be available for QEMU. Efforts are also on to add block device backend in Gluster via Block Device (BD) translator that will expose underlying block devices as files to QEMU. This allows GlusterFS to be a single storage backend for both file and block based storage types.

QEMU with native GlusterFS block driver

GlusterFS specifcation in QEMU

VM image residing on gluster volume can be specified on QEMU command line using URI format:

gluster[+transport]://[server[:port]]/volname/image[?socket=…]

gluster is the protocol.

transport specifies the transport type used to connect to gluster management daemon (glusterd). Valid transport types are tcp, unix and rdma. If a transport type isn’t specified, then tcp type is assumed.

server specifies the server where the volume file specification for the given volume resides. This can be either hostname, ipv4 address or ipv6 address. ipv6 address needs to be within square brackets [ ]. If transport type is unix, then server field should not be specified. Instead the socket field needs to be populated with the path to unix domain socket.

port is the port number on which glusterd is listening. This is optional and if not specified, QEMU will send 0 which will make gluster to use the default port. If the transport type is unix, then port should not be specified.

volname is the name of the gluster volume which contains the VM image.

image is the path to the actual VM image that resides on gluster volume.

Examples:

gluster://1.2.3.4/testvol/a.img
gluster+tcp://1.2.3.4/testvol/a.img
gluster+tcp://1.2.3.4:24007/testvol/dir/a.img
gluster+tcp://[1:2:3:4:5:6:7:8]/testvol/dir/a.img
gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img
gluster+tcp://server.domain.com:24007/testvol/dir/a.img
gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket
gluster+rdma://1.2.3.4:24007/testvol/a.img

(GlusterFS URI description and above examples are taken from QEMU documentation)

Configuring QEMU with GlusterFS backend

While building QEMU from source, in addition to the normal configuration options, ensure that –enable-uuid and –enable-glusterfs options are is specified explicitly with ./configure script. (Update Feb 2013: A fix in QEMU-1.3 time frame makes the use of –enable-uuid unnecessary for GlusterFS support in QEMU)

Update Aug 2013: Starting with QEMU-1.6, pkg-config is used to configure the GlusterFS backend in QEMU. If you are using GlusterFS compiled and installed from sources, then the GlusterFS package config file (glusterfs-api.pc) might not be present at the standard path and you will have to explicitly add the path by executing this command before running the QEMU configure script:

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig/

Without this, GlusterFS driver will not be compiled into QEMU even when GlusterFS is present in the system.

Creating a VM image on GlusterFS backend

qemu-img command can be used to create VM images on gluster backend. The general syntax for image creation looks like this:

qemu-img create gluster://server/volname/path/to/image size

Example:

To create a raw image, qemu-img create gluster://1.2.3.4/testvol/dir/a.img 5G
To create a qcow2 image, qemu-img create -f qcow2 gluster://server.domain.com:24007/testvol/a.img 5G

Booting VM image from GlusterFS backend

A VM image a.img residing on gluster volume testvol can be booted using QEMU like this:

qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio

In addition to VM images, gluster drives can also be used as data drives:

qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio -drive file=gluster://1.2.3.4/datavol/a-data.img,if=virtio

Here a-data.img from datavol gluster volume appears as a 2nd drive for the guest.

Performance numbers

The following numbers from FIO benchmark are to show the performance advantage of using QEMU’s GlusterFS block driver instead of the usual FUSE mount while accessing the VM image.

Test setup

Host	Dual core x86_64 system running Fedora 17 kernel (3.5.6-1.fc17.x86_64)
Guest	Fedora 17 image, 4 way SMP, 2GB RAM, using virtio and cache=none QEMU options

QEMU options

FUSE mount	qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=/mnt/F17,if=virtio,cache=none => /mnt is GlusterFS FUSE mount point
GlusterFS block driver in QEMU (FUSE bypass)	qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=gluster://bharata/test/F17,if=virtio,cache=none
Base (VM image accessed directly from brick)	qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=/test/F17,if=virtio,cache=none => /test is brick directory

FIO load files

Sequential read direct IO	; Read 4 files with aio at different depths [global] ioengine=libaio direct=1 rw=read bs=128k size=512m directory=/data1 [file1] iodepth=4 [file2] iodepth=32 [file3] iodepth=8 [file4] iodepth=16
Sequential write direct IO	; Write 4 files with aio at different depths [global] ioengine=libaio direct=1 rw=write bs=128k size=512m directory=/data1 [file1] iodepth=4 [file2] iodepth=32 [file3] iodepth=8 [file4] iodepth=16

FIO READ numbers

	aggrb (KB/s)	minb (KB/s)	maxb (KB/s)
FUSE mount	15219	3804	5792
QEMU’s GlusterFS block driver (FUSE bypass)	39357	9839	12946
Base	43802	10950	12918

FIO WRITE numbers

	aggrb (KB/s)	minb (KB/s)	maxb (KB/s)
FUSE mount	24579	6144	8423
QEMU’s GlusterFS block driver (FUSE bypass)	42707	10676	17262
Base	42393	10598	15646

Updated numbers

Here are the recent FIO numbers averaged from 5 runs using latest QEMU (git commit: 03a36f17d77) and GlusterFS (git commit: cee1b62d01). The test environment remains same as above with the following two changes:

The GlusterFS volume has write-behind translator turned off
The host kernel is upgraded to 3.6.7-4.fc17.x86_64

FIO READ numbers

	aggrb (KB/s)	% Reduction from Base
Base	44464	0
FUSE mount	21637	-51
QEMU’s GlusterFS block driver (FUSE bypass)	38847	-12.6

FIO WRITE numbers

	aggrb (KB/s)	% Reduction from Base
Base	45824	0
FUSE mount	40919	-10.7
QEMU’s GlusterFS block driver (FUSE bypass)	45627	-0.43

GlusterFS support in oVirt

While I described how to use GlusterFS as a storage backend for QEMU manually, there have been efforts to enable QEMU-GlusterFS native support from libvirt, VDSM and oVirt as well. We now have GlusterFS enabled completely from oVirt which allows user to use self-help portal of oVirt to create GlusterFS volume and use it as storage backend to host VM images. The GlusterFS storage domain work in VDSM and the enablement of the same from oVirt allows oVirt to exploit the QEMU-GlusterFS native integration rather than using FUSE for accessing GlusterFS volume.

Deepak C Shetty has created a nice video demo of how to use oVirt to create a GlusterFS storage domain and boot VMs off it.

UNMAP/Discard support in QEMU-GlusterFS

UNMAP support in QEMU-GlusterFS is explained here.

This entry was posted on Monday, October 29th, 2012 at 10:42 am and is filed under Technical. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

30 Responses to QEMU-GlusterFS native integration

SungWoo Jo says:

October 30, 2012 at 11:02 am

Thank you ! very helpful for me.
But, error happened…

[root@dev156 gluster]# qemu-img create gluster://172.21.80.156/testVol/vol1 5G
Formatting ‘gluster://172.21.80.156/testVol/vol1’, fmt=gluster size=5368709120
qemu-img: gluster://172.21.80.156/testVol/vol1: error while creating gluster: No such file or directory

my qemu version is 1.2.5 and gluster version is 3git latest(Oct 30).
are these versions right?

Reply
- Bharata says:
  
  October 30, 2012 at 12:03 pm
  
  SungWoo,
  
  – Not sure about 1.2.5 version of QEMU, the latest I see is 1.2 and a few rc versions. Can you please test with latest QEMU git tree ?
  – Looks like latest GlusterFS has problems and I am also seeing similar errors as you (glfs_creat in libgfapi is failing). I will debug this further and get back to you.
  
  Reply
  - SungWoo Jo says:
    
    October 30, 2012 at 12:12 pm
    
    Bharata.
    I downloaded qemu source from github( https://github.com/qemu/qemu.git ) and branch is master.
    I tried to find QEMU-1.3 you said, but I couldn’t find.
    Where can I download QEMU-1.3?
    I will also debug my environment ~.
    thanks your reply.
  - Bharata says:
    
    October 30, 2012 at 12:16 pm
    
    QEMU-1.3 is not yet out and hence latest QEMU git from the URL you point out is good enough. As I said I am seeing some problems with latest GlusterFS which I need to debug. Will get back to you after that.
  - Bharata says:
    
    October 30, 2012 at 3:01 pm
    
    SungWoo, Please use –enable-uuid (and also –enable-glusterfs) when you configure QEMU. This will ensure that “qemu-img create …” works correctly.
  - SungWoo Jo says:
    
    October 30, 2012 at 3:49 pm
    
    Bharata. you’re right.
    I omitted -enable-uuid because of configuration error.
    I install libuuid-devel and configure again with enable-uuid.
    thank you very much.
Deepak C Shetty says:

November 22, 2012 at 10:51 am

It would be good to add a short summary after the FIO numbers table, so that readers don’t have to calculate how much the improvement is in their minds :). It also helps people ina hurry to quickly read the summary instead of deciphering the numbers 🙂 – just a thought. Thanks deepak.

Reply
- Bharata says:
  
  November 29, 2012 at 4:36 pm
  
  Deepak – Re-generated the numbers with latest QEMU & GlusterFS and added a column that shows the % change. Thanks for your suggestion.
  
  Reply
Guido De Rosa (@guidoderosa) says:

April 12, 2013 at 4:51 pm

(Not sure I can comment after many months but…)

What if I want to execute qemu-img or qemu-system-* gluster:// *as non root user*?

I tried to fusemount volume and change permissions/acl/ownership on mounted dirs, or even to change them in /export/brick, but it doesn’t work.

There’s a way to “authenticate” via gfapi?

Am I supposed to necessarily run qemu as root if I want this feature?

Thanks for help 🙂

$ qemu-img create gluster://localhost/gv0/QEMU/a.img 20G
Formatting ‘gluster://localhost/gv0/QEMU/a.img’, fmt=raw size=21474836480
qemu-img: Gluster connection failed for server=localhost port=0 volume=gv0 image=QEMU/a.img transport=tcp
qemu-img: gluster://localhost/gv0/QEMU/a.img: error while creating raw: No data available

Reply
- Bharata says:
  
  April 15, 2013 at 9:10 am
  
  If you want to run QEMU-libgfapi as non-root, you need the following two settings:
  
  1. volume set server.allow-insecure on
  2. option rpc-auth-allow-insecure on in glusterd.vol
  
  Reply
  - Guido De Rosa (@guidoderosa) says:
    
    April 15, 2013 at 3:42 pm
    
    Well, thanks. I wanted to run qemu as non-root for security reasons… but setting “allow-insecure” means that I am completely missing my goal, isn’t it?
    
    Well, perhaps I’ve found a better solution. I’m still experimentig with it, so the usual disclaimer applies 🙂 And any comment is welcome.
    
    First, I can certainly run qemu-img as root with no worries: it’s a command I run just once, not a long-standing process – heavily accessing system resources – like QEMU itself.
    
    Then, I can fusemount the gluster volume and change ownership of the image. (Possibly chmod’ing it to 0700).
    
    (At this point, I can safely unmount gluster-fuse, I don’t strictly need it anymore).
    
    Finally, I run qemu as root, but with the -runas option: which means that qemu initially acts as root to talk to the gluster daemon, but later it loses superuser privileges and it’s still able to write to the image it owns.
    
    Guido
Guido De Rosa (@guidoderosa) says:

April 15, 2013 at 3:53 pm

(Er, the image mode could be 0600 of course, not 0700, no need to be executable 😉 sorry).

Reply
Guido De Rosa (@guidoderosa) says:

July 24, 2013 at 4:57 pm

Hi, still interested/involed in the subject 🙂

The problem is: if I run a qemu vm with gluster:// , and I add or replace a brick to the gluster volume while qemu is still running, qemu crashes 😦

Each time…

This doesn’t happen if i run QEMU on a FUSE mount!

Is this – somewhat – “by design” (i.e. If I want such optimization, the good practice is to shutdown VM before each brick management operation) ?

Is this a QEMU bug? A Gluster (or, more specifically, a libgfapi) bug?

Thanks!

Reply
- Bharata says:
  
  July 24, 2013 at 7:15 pm
  
  Last I checked, there were a few issues related to replace-brick being done with libgfapi. Can you please take this discussion to gluster-devel mailing list so that we can hear the latest on that ?
  
  Reply
SATHEESARAN says:

October 8, 2013 at 1:04 pm

Hi there,

I have successfully created image using qemu-img info on gluster volume, using this qemu glusterfs driver.

But I need to use this image to create a VM. I see in your example, that you use qemu-system-x86_64, but usually I use “virt-install” python wrapper. I need how to make use of this image using virt-install command.

Please provide me the info.

Thanks

Reply
- SATHEESARAN says:
  
  October 8, 2013 at 1:05 pm
  
  Mistake to be corrected in my comment, its “qemu-img create” and not “qemu-img info” to create a image.
  
  Reply
- Bharata says:
  
  October 8, 2013 at 8:42 pm
  
  I don’t use virt-install, but I believe virt-install doesn’t support GlusterFS protocol specification in libvirt XML.
  
  Reply
Sebastien Han says:

December 5, 2013 at 8:15 pm

Hi,

Thanks for the write up.
Could you elaborate on your setup?

How many hosts?
How many disks? What kind of disks?
Network bandwidth?
Gluster setup?

Thanks!

Reply
- Bharata says:
  
  December 6, 2013 at 8:08 am
  
  Sebastien – I am only comparing FUSE vs libgfapi FIO numbers for which I have given the setup information in the blog itself.
  
  Reply
  - Sébastien Han says:
    
    December 6, 2013 at 2:04 pm
    
    The only information related to the setup are:
    
    Host: Dual core x86_64 system running Fedora 17 kernel (3.5.6-1.fc17.x86_64)
    
    So you’ve done everything locally?
    
    However you never specified the individual disk speed, so we can’t really interpret your results.
  - Bharata says:
    
    December 6, 2013 at 2:44 pm
    
    Yes, I have done the measurements locally. These numbers are just to show how libgfapi scores over FUSE method in a given setup. The environment is constant for both the runs (libgfapi and FUSE) and hence haven’t bothered to provide more details
SATHEESARAN says:

March 31, 2014 at 7:55 pm

Hi Bharata,
So, is there a way to delete the image that is already available on gluster volume, without mounting the volume, either by fuse or NFS ?

Reply
- SATHEESARAN says:
  
  March 31, 2014 at 7:56 pm
  
  To add to my question, I have created a image using qemu glusterfs driver, but I couldn’t find a way to delete it
  
  Reply
  - Bharata says:
    
    March 31, 2014 at 8:42 pm
    
    Like all other block drivers in QEMU, GlusterFS driver too supports only creation of file on the (GlusterFS) backend. Deletion should be handled outside of QEMU which means it should be done via NFS or FUSE for GlusterFS.
- Deepak C Shetty says:
  
  March 31, 2014 at 8:42 pm
  
  Satheesaran,
  I believe your Q is wrong. Its like asking how i can use qemu to delete a file if the file was present on the gluster mount point. You cannot do that but yes since its a gluster mount, u can do a ‘rm’ to delete the file. Similarly in ur case its accessed via libgfapi, so you can write a small program that links with libgfapi and sends a ‘unlink’ fop to the file u r interested to delete. Qemu libgfapi integration isn’t meant to provide way to manage files on glusterfs, its a way to access files so that files on gluster volume can be used as vm disk images.
  
  If you Q was how to remove a vmdisk (backed by glusterfs usign libgfapi) from the VM’s config, you can use virsh detach-disk or detach-device. Again it only removes the gluster-backed-disk from the VM’s config, not from the glusterfs backend.
  
  HTH – Deepak
  
  Reply
  - SATHEESARAN says:
    
    April 1, 2014 at 9:27 pm
    
    Thanks Deepak for explaining. I was looking for a way to unlink a file without mounting it. That was my intention behind this question.
Deepak C Shetty says:

March 31, 2014 at 9:18 pm

Satheesaran,
To add few other perspectives….

1) If qemu-img command supported a ‘delete’ option, the same could have been extended to glusterfs and deletion would have been possible, but qemu-img supports ‘create’ option but not ‘delete’ 🙂 There must be some history to why they don’t support it, otherwise I dont see a reason why it can’t be supported! My guess is that qemu-img create was supported to be able for users to create qcow2 formatted files since its format is privy to qemu only. Deleting files is just a ‘rm’ operation, so there probably wasn’t a need to support ‘delete’ option in qemu-img. Even if u add ‘delete’ op to qemu-img it still won’t fit in a typical setup as explained in #2 below. My 2 cents! 🙂

2) Typically file create/delete is a storage backend operation while qemu usign a file is a virt operation which are managed by storage and virt admins resp. who have separate roles, resp and permissions in a production setup. Even in openstack, we use cinder to create/delete volumes (files on glsuter backend) and then attach / detach the volumes to nova (VM/instance). Allowing qemu to create/delete files on storage backend is like mixing the virt and storage admin roles/resp./permissions which can cause issues. They are best when kept separate. For eg: In openstack setup, cinder gluster backend is setup for access only for cinder:cinder but qemu process (i.e nova instance) will be running qemu:qemu creds and thus won’t have perms anyways to create/delete files on glsuter backend which is set to cinder:cinder. This is done deliberately so that any file create/delete only happens via cinder as it knows best how to deal with gluster (or any other stoarge) backend.

– Deepak

Reply
kam270Dan says:

April 2, 2014 at 3:24 pm

Bharata, may seem like a simple question but how did you compile QUEMU. Did you pass any special parameters. Can you please provide the command you used.

Thanks

Reply
Eric says:

March 2, 2015 at 9:32 pm

That improvement in speed is very impressive. I knew I wasn’t a fan of FUSE, but didn’t know by how much. Now I do.

Reply
Recent Adventures in oVirt and Gluster – jebpages says:

August 10, 2017 at 11:04 pm

[…] long after I started using oVirt and Gluster together, the projects started talking about a way to improve Gluster performance by enabling virtualization hosts to access Gluster volumes directly, using Gluster’s […]

Reply