GlusterFS Block Device Translator

Block device translator

Block device translator (BD xlator) is a new translator added to GlusterFS recently which provides block backend for GlusterFS. This replaces the existing bd_map translator in GlusterFS that provided similar but very limited functionality. GlusterFS expects the underlying brick to be formatted with a POSIX compatible file system. BD xlator changes that and allows for having bricks that are raw block devices like LVM which needn’t have any file systems on them. Hence with BD xlator, it becomes possible to build a GlusterFS volume comprising of bricks that are logical volumes (LV).

bd

BD xlator maps underlying LVs to files and hence the LVs appear as files to GlusterFS clients. Though BD volume externally appears very similar to the usual Posix volume, not all operations are supported or possible for the files on a BD volume. Only those operations that make sense for a block device are supported and the exact semantics are described in subsequent sections.

While Posix volume takes a file system directory as brick, BD volume needs a volume group (VG) as brick. In the usual use case of BD volume, a file created on BD volume will result in an LV being created in the brick VG. In addition to a VG, BD volume also needs a file system directory that should be specified at the volume creation time. This directory is necessary for supporting the notion of directories and directory hierarchy for the BD volume. Metadata about LVs (size, mapping info) is stored in this directory.

BD xlator was mainly developed to use block devices directly as VM images when GlusterFS is used as storage for KVM virtualization. Some of the salient points of BD xlator are

  • Since BD supports file level snapshots and clones by leveraging the snapshot and clone capabilities of LVM, it can be used to fully off-load snapshot and cloning operations from QEMU to the storage (GlusterFS) itself.
  • BD understands dm-thin LVs and hence can support files that are backed by thinly provisioned LVs. This capability of BD xlator translates to having thinly provisioned raw VM images.
  • BD enables thin LVs from a thin pool to be used from multiple nodes that have visibility to GlusterFS BD volume. Thus thin pool can be used as a VM image repository allowing access/visibility to it from multiple nodes.
  • BD supports true zerofill by using BLKZEROOUT ioctl on underlying block devices. Thus BD allows SCSI WRITESAME to be used on underlying block device if the device supports it.

Though BD xlator is primarily intended to be used with block devices, it does provide full Posix xlator compatibility for files that are created on BD volume but are not backed by or mapped to a block device. Such files which don’t have a block device mapping exist on the Posix directory that is specified during BD volume creation.

Availability

BD xlator developed by M. Mohan Kumar was committed into GlusterFS git in November 2013 and is expected to be part of upcoming GlusterFS-3.5 release.

Compiling BD translator

BD xlator needs lvm2 development library. –enable-bd-xlator option can be used with ./configure script to explicitly enable BD translator. The following snippet from the output of configure script shows that BD xlator is enabled for compilation.

GlusterFS configure summary
===================

Block Device xlator  : yes

Creating a BD volume

BD supports hosting of both linear LV and thin LV within the same volume. However I will be showing them separately in the following instructions. As noted above, the prerequisite for a BD volume is VG which I am creating here from a loop device, but it can be any other device too.

1. Creating BD volume with linear LV backend

- Create a loop device

[root@bharata ~]# dd if=/dev/zero of=bd-loop count=1024 bs=1M
[root@bharata ~]# losetup /dev/loop0 bd-loop

- Prepare a brick by creating a VG

[root@bharata ~]# pvcreate /dev/loop0
[root@bharata ~]# vgcreate bd-vg /dev/loop0

- Create the BD volume

Create a POSIX directory first
[root@bharata ~]# mkdir /bd-meta

It is recommended that this directory is created on an LV in the brick VG itself so that both data and metadata live together on the same device.

Create and mount the volume
[root@bharata ~]# gluster volume create bd bharata:/bd-meta?bd-vg force

The general syntax for specifying the brick is host:/posix-dir?volume-group-name where “?” is the separator.

[root@bharata ~]# gluster volume start bd
[root@bharata ~]# gluster volume info bd

Volume Name: bd
Type: Distribute
Volume ID: cb042d2a-f435-4669-b886-55f5927a4d7f
Status: Started
Xlator 1: BD
Capability 1: offload_copy
Capability 2: offload_snapshot
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: bharata:/bd-meta
Brick1 VG: bd-vg

[root@bharata ~]# mount -t glusterfs bharata:/bd /mnt

- Create a file that is backed by an LV

[root@bharata ~]# ls /mnt
[root@bharata ~]#

Since the volume is empty now, so is the underlying VG.
[root@bharata ~]# lvdisplay bd-vg
[root@bharata ~]#

Creating a file that is mapped to an LV is a 2 step operation. First the file should be created on the mount point and a specific extended attribute should be set to map the file to LV.

[root@bharata ~]# touch /mnt/lv
[root@bharata ~]# setfattr -n “user.glusterfs.bd” -v “lv” /mnt/lv

Now an LV got created in the VG brick and the file /mnt/lv maps to this LV. Any read/write to this file ends up as read/write to the underlying LV.
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                          /dev/bd-vg/6ff0f25f-2776-4d19-adfb-df1a3cab8287
LV Name                        6ff0f25f-2776-4d19-adfb-df1a3cab8287
VG Name                       bd-vg
LV UUID                         PjMPcc-RkD5-RADz-6ixG-UYsk-oclz-vL0nv6
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:15:45 +0530
LV Status                      available
# open                          0
LV Size                    4.00 MiB
Current LE                   1
Segments                     1
Allocation                     inherit
Read ahead sectors    0
Block device                253:6

The file gets created with default LV size which is 1 LE which is 4MB in this case.
[root@bharata ~]# ls -lh /mnt/lv
-rw-r–r–. 1 root root 4.0M Nov 26 16:15 /mnt/lv

truncate can be used to set the required file size.
[root@bharata ~]# truncate /mnt/lv -s 256M
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                          /dev/bd-vg/6ff0f25f-2776-4d19-adfb-df1a3cab8287
LV Name                        6ff0f25f-2776-4d19-adfb-df1a3cab8287
VG Name                       bd-vg
LV UUID                         PjMPcc-RkD5-RADz-6ixG-UYsk-oclz-vL0nv6
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:15:45 +0530
LV Status                       available
# open                           0
LV Size                     256.00 MiB
Current LE                   64
Segments                      1
Allocation                      inherit
Read ahead sectors     0
Block device                 253:6

[root@bharata ~]# ls -lh /mnt/lv
-rw-r–r–. 1 root root 256M Nov 26 16:15 /mnt/lv

The size of the file/LV can be specified during creation/mapping time itself like this:
setfattr -n “user.glusterfs.bd” -v “lv:256MB” /mnt/lv

2. Creating BD volume with thin LV backend

- Create a loop device

[root@bharata ~]# dd if=/dev/zero of=bd-loop-thin count=1024 bs=1M
[root@bharata ~]# losetup /dev/loop0 bd-loop-thin

- Prepare a brick by creating a VG and thin pool

[root@bharata ~]# pvcreate /dev/loop0
[root@bharata ~]# vgcreate bd-vg-thin /dev/loop0

Create a thin pool
[root@bharata ~]# lvcreate –thin bd-vg-thin -L 1000M
Rounding up size to full physical extent 4.00 MiB
Logical volume “lvol0″ created

lvdisplay shows the thin pool
[root@bharata ~]# lvdisplay bd-vg-thin
— Logical volume —
LV Name                       lvol0
VG Name                      bd-vg-thin
LV UUID                        HVa3EM-IVMS-QG2g-oqU6-1UxC-RgqS-g8zhVn
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:39:06 +0530
LV Pool transaction ID  0
LV Pool metadata          lvol0_tmeta
LV Pool data                  lvol0_tdata
LV Pool chunk size       64.00 KiB
LV Zero new blocks     yes
LV Status                      available
# open                          0
LV Size                          1000.00 MiB
Allocated pool data     0.00%
Allocated metadata     0.88%
Current LE                   250
Segments                     1
Allocation                     inherit
Read ahead sectors     auto
- currently set to         256
Block device                253:9

- Create the BD volume

Create a POSIX directory first
[root@bharata ~]# mkdir /bd-meta-thin

Create and mount the volume
[root@bharata ~]# gluster volume create bd-thin bharata:/bd-meta-thin?bd-vg-thin force
[root@bharata ~]# gluster volume start bd-thin
[root@bharata ~]# gluster volume info bd-thin

Volume Name: bd-thin
Type: Distribute
Volume ID: 27aa7eb0-4ffa-497e-b639-7cbda0128793
Status: Started
Xlator 1: BD
Capability 1: thin
Capability 2: offload_copy
Capability 3: offload_snapshot
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: bharata:/bd-meta-thin
Brick1 VG: bd-vg-thin
[root@bharata ~]# mount -t glusterfs bharata:/bd-thin /mnt

- Create a file that is backed by a thin LV

[root@bharata ~]# ls /mnt
[root@bharata ~]#

Creating a file that is mapped to a thin LV is a 2 step operation. First the file should be created on the mount point and a specific extended attribute should be set to map the file to a thin LV.

[root@bharata ~]# touch /mnt/thin-lv
[root@bharata ~]# setfattr -n “user.glusterfs.bd” -v “thin:256MB” /mnt/thin-lv

Now /mnt/thin-lv is a thin provisioned file that is backed by a thin LV.
[root@bharata ~]# lvdisplay bd-vg-thin
— Logical volume —
LV Name                        lvol0
VG Name                       bd-vg-thin
LV UUID                         HVa3EM-IVMS-QG2g-oqU6-1UxC-RgqS-g8zhVn
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:39:06 +0530
LV Pool transaction ID 1
LV Pool metadata         lvol0_tmeta
LV Pool data                 lvol0_tdata
LV Pool chunk size       64.00 KiB
LV Zero new blocks     yes
LV Status                      available
# open                         0
LV Size                         1000.00 MiB
Allocated pool data    0.00%
Allocated metadata    0.98%
Current LE                  250
Segments                    1
Allocation                    inherit
Read ahead sectors   auto
- currently set to        256
Block device               253:9

— Logical volume —
  LV Path                     /dev/bd-vg-thin/081b01d1-1436-4306-9baf-41c7bf5a2c73
LV Name                        081b01d1-1436-4306-9baf-41c7bf5a2c73
VG Name                       bd-vg-thin
LV UUID                         coxpTY-2UZl-9293-8H2X-eAZn-wSp6-csZIeB
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:43:19 +0530
LV Pool name                 lvol0
LV Status                       available
# open                           0
  LV Size                     256.00 MiB
Mapped size                  0.00%
Current LE                    64
Segments                      1
Allocation                      inherit
Read ahead sectors     auto
- currently set to          256
Block device                 253:10

As can be seen from above, creation of a file resulted in creation of a thin LV in the brick.

Snapshots and clones

BD xlator uses LVM snapshot and clone capabilities to provide file level snapshots and clones for files on GlusterFS volume. Snapshots and clones work only for those files that have been already mapped to an LV. In other words, snapshots and clones aren’t for Posix-only file that exist on BD volume.

Creating a snapshot

Say we are interested in taking snapshot of a file /mnt/file that already exists and has been mapped to an LV.

[root@bharata ~]# ls -l /mnt/file
-rw-r–r–. 1 root root 268435456 Nov 27 10:16 /mnt/file

[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                        /dev/bd-vg/abf93bbd-2c78-4612-8822-c4e0a40c4626
LV Name                      abf93bbd-2c78-4612-8822-c4e0a40c4626
VG Name                      bd-vg
LV UUID                        HwSRTL-UdPH-MMz7-rg7U-pU4a-yS4O-59bDfY
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-27 10:16:54 +0530
LV Status                      available
# open                          0
LV Size                          256.00 MiB
Current LE                   64
Segments                     1
Allocation                     inherit
Read ahead sectors    0
Block device                253:6

Snapshot creation is a two step process.

- Create a snapshot destination file first
[root@bharata ~]# touch /mnt/file-snap

- Then take the actual snapshot
In order to create the actual snapshot, we need to know the GFID of the snapshot file.

[root@bharata ~]# getfattr -n glusterfs.gfid.string  /mnt/file-snap
getfattr: Removing leading ‘/’ from absolute path names
# file: mnt/file-snap
glusterfs.gfid.string=”bdf74e38-dc96-4b26-94e2-065fe3b8bcc3″

Use this GFID string to create the actual snapshot
[root@bharata ~]# setfattr -n snapshot -v bdf74e38-dc96-4b26-94e2-065fe3b8bcc3 /mnt/file
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                         /dev/bd-vg/abf93bbd-2c78-4612-8822-c4e0a40c4626
LV Name                       abf93bbd-2c78-4612-8822-c4e0a40c4626
VG Name                      bd-vg
LV UUID                        HwSRTL-UdPH-MMz7-rg7U-pU4a-yS4O-59bDfY
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-27 10:16:54 +0530
LV snapshot status   source of bdf74e38-dc96-4b26-94e2-065fe3b8bcc3 [active]
LV Status                       available
# open                           0
LV Size                           256.00 MiB
Current LE                   64
Segments                      1
Allocation                      inherit
Read ahead sectors     0
Block device                 253:6

— Logical volume —
LV Path                        /dev/bd-vg/bdf74e38-dc96-4b26-94e2-065fe3b8bcc3
LV Name                      bdf74e38-dc96-4b26-94e2-065fe3b8bcc3
VG Name                      bd-vg
LV UUID                        9XH6xX-Sl64-uNhk-7OiH-f91m-DaMo-6AWiBD
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-27 10:20:35 +0530
LV snapshot status  active destination for abf93bbd-2c78-4612-8822-c4e0a40c4626
LV Status                      available
# open                          0
LV Size                          256.00 MiB
Current LE                   64
COW-table size             4.00 MiB
COW-table LE               1
Allocated to snapshot  0.00%
Snapshot chunk size    4.00 KiB
Segments                      1
Allocation                      inherit
Read ahead sectors     auto
- currently set to          256
Block device                 253:7

As can be seen from the lvdisplay output, /mnt/file-snap now is the snapshot of /mnt/file.

Creating a clone

Creating a clone is similar to creating a snapshot except that “clone” attribute name should be used instead of “snapshot”.

setfattr -n clone -v <gfid-of-clone-file> <path-to-source-file>

Clone in BD volume is essentially a server off-loaded full copy of the file.

Concerns

As you have seen, creation of block device backed file on BD volume, creation of snapshots and clones involve non-standard steps including setting of extended attributes. These steps could be cumbersome for an end user and there are plans to encapsulate all these into nice APIs that users could use easily.

About these ads

2 Responses to GlusterFS Block Device Translator

  1. Attb2 says:

    If BD Xlator supports only one brick how could it be a redundant cluster filesystem?

    • Bharata says:

      The first version of BD xlator (which was called bd_map) supported only one brick. But the current version that replaces the old bd_map and called just bd (and which will be part of GlusterFS-3.5) supports multiple bricks.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: