GlusterFS is a distributed file system implemented in user space. It is strictly not a native file system in itself but is an aggregator of different file systems. GlusterFS can aggregate individual file system mount points or directories (called bricks in gluster terminology) to provide a single unified file system namespace. In addition to NFS and CIFS, the most common
way to access GlusterFS namespace is via FUSE based Gluster native client.
More information on creating and mounting GlusterFS volume can be obtained from GlusterFS website.
GlusterFS for virtualization
Until recently using GlusterFS volumes to host VM images and data was sub-optimal due to the FUSE overhead involved in accessing gluster volumes via GlusterFS native client. However this has changed now with two specific enhancements:
- A new library called libgfapi is now available as part of GlusterFS that provides POSIX-like C APIs for accessing gluster volumes. libgfapi support will be available from GlusterFS-3.4 release.
- QEMU (starting from QEMU-1.3) will have GlusterFS block driver that uses libgfapi and hence there is no FUSE overhead any longer when QEMU works with VM images on gluster volumes.
GlusterFS with its pluggable translator model can serve as a flexible storage backend for QEMU. QEMU has to just talk to GlusterFS and GlusterFS will hide different file systems and storage types underneath. Various GlusterFS storage features like replication and striping will automatically be available for QEMU. Efforts are also on to add block device backend in Gluster via Block Device (BD) translator that will expose underlying block devices as files to QEMU. This allows GlusterFS to be a single storage backend for both file and block based storage types.
GlusterFS specifcation in QEMU
VM image residing on gluster volume can be specified on QEMU command line using URI format:
gluster[+transport]://[server[:port]]/volname/image[?socket=...]
gluster is the protocol.
transport specifies the transport type used to connect to gluster management daemon (glusterd). Valid transport types are tcp, unix and rdma. If a transport type isn’t specified, then tcp type is assumed.
server specifies the server where the volume file specification for the given volume resides. This can be either hostname, ipv4 address or ipv6 address. ipv6 address needs to be within square brackets [ ]. If transport type is unix, then server field should not be specified. Instead the socket field needs to be populated with the path to unix domain socket.
port is the port number on which glusterd is listening. This is optional and if not specified, QEMU will send 0 which will make gluster to use the default port. If the transport type is unix, then port should not be specified.
volname is the name of the gluster volume which contains the VM image.
image is the path to the actual VM image that resides on gluster volume.
Examples:
gluster://1.2.3.4/testvol/a.img
gluster+tcp://1.2.3.4/testvol/a.img
gluster+tcp://1.2.3.4:24007/testvol/dir/a.img
gluster+tcp://[1:2:3:4:5:6:7:8]/testvol/dir/a.img
gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img
gluster+tcp://server.domain.com:24007/testvol/dir/a.img
gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket
gluster+rdma://1.2.3.4:24007/testvol/a.img
(GlusterFS URI description and above examples are taken from QEMU documentation)
Configuring QEMU with GlusterFS backend
While building QEMU from source, in addition to the normal configuration options, ensure that –enable-uuid and –enable-glusterfs options are is specified explicitly with ./configure script. (Update Feb 2013: A fix in QEMU-1.3 time frame makes the use of –enable-uuid unnecessary for GlusterFS support in QEMU)
Creating a VM image on GlusterFS backend
qemu-img command can be used to create VM images on gluster backend. The general syntax for image creation looks like this:
qemu-img create gluster://server/volname/path/to/image size
Example:
To create a raw image, qemu-img create gluster://1.2.3.4/testvol/dir/a.img 5G
To create a qcow2 image, qemu-img create -f qcow2 gluster://server.domain.com:24007/testvol/a.img 5G
Booting VM image from GlusterFS backend
A VM image a.img residing on gluster volume testvol can be booted using QEMU like this:
qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio
In addition to VM images, gluster drives can also be used as data drives:
qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio -drive file=gluster://1.2.3.4/datavol/a-data.img,if=virtio
Here a-data.img from datavol gluster volume appears as a 2nd drive for the guest.
Performance numbers
The following numbers from FIO benchmark are to show the performance advantage of using QEMU’s GlusterFS block driver instead of the usual FUSE mount while accessing the VM image.
Test setup
| Host | Dual core x86_64 system running Fedora 17 kernel (3.5.6-1.fc17.x86_64) |
| Guest | Fedora 17 image, 4 way SMP, 2GB RAM, using virtio and cache=none QEMU options |
QEMU options
| FUSE mount | qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=/mnt/F17,if=virtio,cache=none => /mnt is GlusterFS FUSE mount point |
| GlusterFS block driver in QEMU (FUSE bypass) | qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=gluster://bharata/test/F17,if=virtio,cache=none |
| Base (VM image accessed directly from brick) | qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=/test/F17,if=virtio,cache=none => /test is brick directory |
FIO load files
| Sequential read direct IO | ; Read 4 files with aio at different depths [global] ioengine=libaio direct=1 rw=read bs=128k size=512m directory=/data1 [file1] iodepth=4 [file2] iodepth=32 [file3] iodepth=8 [file4] iodepth=16 |
| Sequential write direct IO | ; Write 4 files with aio at different depths [global] ioengine=libaio direct=1 rw=write bs=128k size=512m directory=/data1 [file1] iodepth=4 [file2] iodepth=32 [file3] iodepth=8 [file4] iodepth=16 |
FIO READ numbers
| aggrb (KB/s) | minb (KB/s) | maxb (KB/s) | |
| FUSE mount | 15219 | 3804 | 5792 |
| QEMU’s GlusterFS block driver (FUSE bypass) | 39357 | 9839 | 12946 |
| Base | 43802 | 10950 | 12918 |
FIO WRITE numbers
| aggrb (KB/s) | minb (KB/s) | maxb (KB/s) | |
| FUSE mount | 24579 | 6144 | 8423 |
| QEMU’s GlusterFS block driver (FUSE bypass) | 42707 | 10676 | 17262 |
| Base | 42393 | 10598 | 15646 |
Updated numbers
Here are the recent FIO numbers averaged from 5 runs using latest QEMU (git commit: 03a36f17d77) and GlusterFS (git commit: cee1b62d01). The test environment remains same as above with the following two changes:
- The GlusterFS volume has write-behind translator turned off
- The host kernel is upgraded to 3.6.7-4.fc17.x86_64
FIO READ numbers
| aggrb (KB/s) | % Reduction from Base | |
| Base | 44464 | 0 |
| FUSE mount | 21637 | -51 |
| QEMU’s GlusterFS block driver (FUSE bypass) | 38847 | -12.6 |
FIO WRITE numbers
| aggrb (KB/s) | % Reduction from Base | |
| Base | 45824 | 0 |
| FUSE mount | 40919 | -10.7 |
| QEMU’s GlusterFS block driver (FUSE bypass) | 45627 | -0.43 |
GlusterFS support in oVirt
While I described how to use GlusterFS as a storage backend for QEMU manually, there have been efforts to enable QEMU-GlusterFS native support from libvirt, VDSM and oVirt as well. We now have GlusterFS enabled completely from oVirt which allows user to use self-help portal of oVirt to create GlusterFS volume and use it as storage backend to host VM images. The GlusterFS storage domain work in VDSM and the enablement of the same from oVirt allows oVirt to exploit the QEMU-GlusterFS native integration rather than using FUSE for accessing GlusterFS volume.
Deepak C Shetty has created a nice video demo of how to use oVirt to create a GlusterFS storage domain and boot VMs off it.


Thank you ! very helpful for me.
But, error happened…
[root@dev156 gluster]# qemu-img create gluster://172.21.80.156/testVol/vol1 5G
Formatting ‘gluster://172.21.80.156/testVol/vol1′, fmt=gluster size=5368709120
qemu-img: gluster://172.21.80.156/testVol/vol1: error while creating gluster: No such file or directory
my qemu version is 1.2.5 and gluster version is 3git latest(Oct 30).
are these versions right?
SungWoo,
- Not sure about 1.2.5 version of QEMU, the latest I see is 1.2 and a few rc versions. Can you please test with latest QEMU git tree ?
- Looks like latest GlusterFS has problems and I am also seeing similar errors as you (glfs_creat in libgfapi is failing). I will debug this further and get back to you.
Bharata.
I downloaded qemu source from github( https://github.com/qemu/qemu.git ) and branch is master.
I tried to find QEMU-1.3 you said, but I couldn’t find.
Where can I download QEMU-1.3?
I will also debug my environment ~.
thanks your reply.
QEMU-1.3 is not yet out and hence latest QEMU git from the URL you point out is good enough. As I said I am seeing some problems with latest GlusterFS which I need to debug. Will get back to you after that.
SungWoo, Please use –enable-uuid (and also –enable-glusterfs) when you configure QEMU. This will ensure that “qemu-img create …” works correctly.
Bharata. you’re right.
I omitted -enable-uuid because of configuration error.
I install libuuid-devel and configure again with enable-uuid.
thank you very much.
It would be good to add a short summary after the FIO numbers table, so that readers don’t have to calculate how much the improvement is in their minds
. It also helps people ina hurry to quickly read the summary instead of deciphering the numbers
– just a thought. Thanks deepak.
Deepak – Re-generated the numbers with latest QEMU & GlusterFS and added a column that shows the % change. Thanks for your suggestion.
(Not sure I can comment after many months but…)
What if I want to execute qemu-img or qemu-system-* gluster:// *as non root user*?
I tried to fusemount volume and change permissions/acl/ownership on mounted dirs, or even to change them in /export/brick, but it doesn’t work.
There’s a way to “authenticate” via gfapi?
Am I supposed to necessarily run qemu as root if I want this feature?
Thanks for help
$ qemu-img create gluster://localhost/gv0/QEMU/a.img 20G
Formatting ‘gluster://localhost/gv0/QEMU/a.img’, fmt=raw size=21474836480
qemu-img: Gluster connection failed for server=localhost port=0 volume=gv0 image=QEMU/a.img transport=tcp
qemu-img: gluster://localhost/gv0/QEMU/a.img: error while creating raw: No data available
If you want to run QEMU-libgfapi as non-root, you need the following two settings:
1. volume set server.allow-insecure on
2. option rpc-auth-allow-insecure on in glusterd.vol
Well, thanks. I wanted to run qemu as non-root for security reasons… but setting “allow-insecure” means that I am completely missing my goal, isn’t it?
Well, perhaps I’ve found a better solution. I’m still experimentig with it, so the usual disclaimer applies
And any comment is welcome.
First, I can certainly run qemu-img as root with no worries: it’s a command I run just once, not a long-standing process – heavily accessing system resources – like QEMU itself.
Then, I can fusemount the gluster volume and change ownership of the image. (Possibly chmod’ing it to 0700).
(At this point, I can safely unmount gluster-fuse, I don’t strictly need it anymore).
Finally, I run qemu as root, but with the -runas option: which means that qemu initially acts as root to talk to the gluster daemon, but later it loses superuser privileges and it’s still able to write to the image it owns.
Guido
(Er, the image mode could be 0600 of course, not 0700, no need to be executable
sorry).