Cinder’s Ceph Replication Sneak peek 9


Have you been dying to try out the Volume Replication functionality in OpenStack but you didn’t have some enterprise level storage with replication features lying around for you to play with? Then you are in luck!, because thanks to Ceph’s new RBD mirroring functionality and Jon Bernard’s work on Cinder, you can now have the full replication experience using your commodity hardware and I’m going to tell you how you can have a preview of what’s about to come to Cinder.

sneak peek

RBD mirroring

I’m sure by now there’s no need to explain the power of Ceph or how it has changed IT infrastructures everywhere, but you may have missed the new mirroring feature introduced back in April adding support for mirroring (asynchronous replication) of RBD images across clusters. If you missed it, you may want to go ahead and have a look at the official documentation or Sébastien Han’s blog post on the subject.

The short version is that RBD images can now be asynchronously mirrored between two Ceph clusters. The capability that makes this possible is the RBD journaling image feature. By leveraging the journaling functionality it can ensure crash-consistent replication between clusters.

RBD mirroring has a good level of granularity, because even if it’s defined on a per-pool basis you can decide whether you want to automatically mirror all images within the pool or only specific images. This means that you could have your Ceph cluster replicated to another cluster using automatic pool mirroring even without the new Ceph replication functionality in Cinder; but that’s not the approach taken by Cinder, since it would be too restrictive and that’s not the Cinder way ;-).

Mirroring is configured using rbd commands, and the rbd-mirror daemon is responsible for pulling image updates from the one peer cluster -thanks to the journaling- and applying them to the image within the other cluster performing the cross-cluster replication.

One interesting thing to know is that mirroring doesn’t need to be enabled both ways, you can have it mirroring just from your primary cluster to your secondary cluster for failover to work, but it will be necessary for the failback; so in this post we will be enabling it both ways.

Replication in Cinder

Initial volume replication was introduced as v1, but in the Liberty cycle v2 was implemented and then improved in the next release. And it’s this Replication v2.1 that is being implemented by this new patch.

This replication version covers only 1 very specific use case, as mentioned in the specifications, and that is that a catastrophic event has occurred on the primary backend storage device, and we want to failover to the volumes available on the replicated backend.

The replication mechanism depends on the driver’s implementation, so there will be drivers that make all created volumes as replicated and those that use a per volume policy and require a volume type to tell them which volumes to replicate. This means that you have to know your backend driver and its features since Cinder doesn’t enforce one behavior or the other.

There is no automatic failover since the use case is Disaster Recovery, and it must be done manually when the primary backend is out.

Volumes that were attached/in-use are a special case and will require that either the tenants or the administrator detach and reattach the volumes since the instances’ attached volumes are addressing the primary backend that is not longer available.

There’s also the possibility to freeze the creation, deletion, extension of volumes or snapshots on a backend that is in failover mode until the admin decides that the system is stable again and those operations can be re-enabled using the thaw command. Freezing the management plane will not stop R/W I/O from being performed on resources that exist.

Ceph Replication in Cinder

To test Ceph’s replication in Cinder there are only a couple of things that you will need:

  1. Cinder, Nova, and Keystone services. These will usually be a part of a full OpenStack deployment.
  2. Two different Ceph clusters.
  3. Configure Ceph clusters in mirrored mode and to mirror the pool used by Cinder.
  4. Copy cluster keys to the Cinder Volume node.
  5. Configure Ceph driver in Cinder to use replication.

The RBD driver in Cinder could have taken care of the third step, but we believe that would violate the separation of concerns between the infrastructure deployment and its usage by Cinder. Not to mention that it would be introducing unnecessary restrictions to deployments -since the driver would probably not be as flexible as Ceph’s tools- and avoidable complexity to the RBD driver. So it is the system administrator’s responsibility to properly setup the Ceph clusters.

While the RBD driver will not enable the mirroring between the two pools and expect it to be set-up by the system administrator, it will be setting journaling and image mirroring for replicated volumes, as we have to explicitly tell rbd-mirror which volumes should be mirrored.

For these tests I’ve decided to use Ubuntu instead of my usual Fedora deployment since it’s the Operating System used at the gate, and in my last test setup people run into problems when testing things under Ubuntu and I wanted to avoid this situation if possible, so all instructions are meant to be run under Ubuntu Xenial.

Jon Bernard has created, not only the patch to implement the feature, but also some Replication Notes that were of great help when automating my deployment.

Since Jon Bernard is in paternity leave and I was familiar with his work I will be taking care of following up on the reviews and updating his patch.

ATTENTION: Following steps are only here to illustrate what needs to be done, for those interested in the details, but the whole process has been automated, so you don’t need to do any of this and can skip to the Automated deployment if you are not interested in the deployment side of things.

Cinder services

We’ll be testing the RBD driver updated with the Replication functionality using DevStack and pulling the rbd replication patch directly from gerrit. Under normal circumstances you would be able to try it with any deployment because you would only need to pull the rbd.py file from the patch under review to get the new functionality, but unfortunately there were some other issues in Cinder 1, 2, 3 that required fixing for the Ceph Replication patch to work.

We’ll be using this configuration to deploy DevStack with 2 different backends, LVM and Ceph, and DevStack will not be provisioning any of our two Ceph clusters, we’ll be doing that ourselves.

Ceph clusters

There are multiple ways to deploy your Ceph clusters, you can do it manually -using ceph-deploy tool or doing all the steps yourself-, or you can use ceph-ansible, TripleO, and even DevStack can deploy one of the clusters when stacking.

There are multiple benefits from having your Ceph cluster out of your DevStack deployment and not letting DevStack deploy it for you, as there’s no need to recreate the primary Ceph cluster every time you run stack.sh, since the Ceph cluster nodes have 2 network interfaces and the Ceph communications are done only on the 10.0.1.x network you can easily simulate losing the network connection just by bringing this interface down on one of the Ceph clusters.

In this explanation I will be using ceph-deploy in a custom script, but you can use any other method to deploy the 2 clusters, it’s all good as long as you end up with 2 clusters that have connectivity between them and with the Cinder Volume service and Nova Computes.

The script we’ll be using to do the deployment is quite simple, thanks to the usage of ceph-deploy, and will be run to create both clusters receiving the node’s binding IP address as argument. This script will also install and deploy the rbd-mirror but will not link both pools since we need to wait until both of them are created, and we’ll be doing the linking from the DevStack node.

#!/bin/bash

my_ip=$1
hostname=`hostname -s`

cd /root

echo "Installing ceph-deploy from pip to avoid existing package conflict"
apt-get update
apt-get install -y python-pip
pip install ceph-deploy

echo "Creating cluster"
ceph-deploy new $hostname --public-network $my_ip/24 || exit 1
echo -e "\nosd pool default size = 1\nosd crush chooseleaf type = 0\n" >> "ceph.conf"

echo "Installing from upstream repository"
echo -e "\n[myrepo]\nbaseurl = http://gitbuilder.ceph.com/ceph-deb-xenial-x86_64-basic/ref/master\ngpgkey = https://download.ceph.com/keys/autobuild.asc\ndefault = True" >> .cephdeploy.conf
ceph-deploy install $hostname || exit 2

echo "Deploying Ceph monitor"
ceph-deploy mon create-initial || exit 3

echo "Creating OSD on /dev/vdb"
ceph-deploy osd create $hostname:/dev/vdb || exit 3

echo "Creating default volumes pool"
ceph osd pool create volumes 100

echo "Deploying Ceph MDS"
ceph-deploy mds create $hostname || exit 4

echo "Health should be OK"
ceph -s

echo "Put Ceph to autostart"
systemctl enable ceph.target
systemctl enable ceph-mon.target
systemctl enable ceph-osd.target

echo "Installing ceph-mirror package and run it as a service"
apt-get install -y rbd-mirror || exit 5
systemctl enable ceph-rbd-mirror@admin
systemctl start ceph-rbd-mirror@admin

Configure Ceph clusters

As mentioned earlier, the system administrator is expected to setup the mirroring of the clusters, and the first step in this process is installing rbd-mirror and running it, which we already did in the previous step when the script run:

user@localhost:$ apt-get install -y rbd-mirror || exit 5

user@localhost:$ sudo systemctl enable ceph-rbd-mirror@admin

user@localhost:$ sudo systemctl start ceph-rbd-mirror@admin

Next we need to tell the rbd-mirror in each of the clusters about its own cluster and the user:group to use:

user@localhost:$ sudo rbd-mirror --setuser ceph --setgroup ceph

Now we have to configure the volumes pool to use per-image mirroring (in both clusters as well):

user@localhost:$ sudo rbd mirror pool enable volumes image

Since each of the rbd-mirror services will be talking with the other cluster, it needs to know about the configuration and have a key to access it. That’s why we now need to copy the configuration and key from one cluster to the other. Since the /etc/ceph/ceph.conf and /etc/ceph/ceph.client.admin.keyring already exist on the clusters we will copy these files using other names, on the primary we’ll copy the files from the secondary as ceph-secondary.conf and ceph-secondary.client.admin.keyring, and on the secondary we’ll do the same calling them ceph-primary.conf and ceph-primary.client.admin.keyring.

Once we have copied the files, remember to set the owner:group to ceph:ceph -since that’s what we told the rbd-mirror to use- and the file mode to 0600.

The last step is to configure the peers of the pools so they know where they should be replicating, primary to secondary, and secondary to primary.

NOTE: It is very important to know that Ceph only works with host names, not IPs, so we need to add to our /etc/hosts file the other node’s cluster name, in our case we need to add ceph-secondary to the primary node’s hosts and ceph-primary to the secondary’s hosts file. If we don’t do this we won’t be able to pair the nodes, since rbd-mirror won’t be able to locate the other cluster.

On the primary cluster we run:

user@localhost:$ sudo rbd mirror pool peer add volumes client.admin@ceph-secondary

And on the secondary cluster we run:

user@localhost:$ sudo rbd mirror pool peer add volumes client.admin@ceph-primary

NOTE: For mirroring to work, pools in both clusters need to have the same name.

Copy cluster keys

Just like we needed to have host names for the mirroring Ceph clusters in each of our clusters, we’ll also need them in our DevStack deployment, same thing with the Ceph configuration and keyring files.

We have to copy these files and make them readable by the user that will be running the DevStack, in our case this will be the vagrant user, and for convenience we won’t be renaming the files from the primary node, as these are the default for the RBD driver in Cinder.

Configure Ceph driver in Cinder to use replication.

The configuration in Cinder is easy and straightforward, what we need are 2 things:

  • Add replication_device configuration option to the driver’s section.
  • Create a specific volume type that enables replication.

The Cinder configuration for replication_device should go in the specific Ceph driver section in /etc/cinder/cinder.conf defining 4 things:

  • The backend_id, which is the host name of the secondary server.
  • The conf file, which will have the configuration for the secondary ceph cluster. There’s no need to define it if the default of /etc/ceph/{backend_id}.conf is valid.
  • The user of the secondary cluster that Cinder RBD driver should use to communicate. Defaults to the same user as the primary.

This is an example of the configuration:

[ceph]
replication_device = backend_id:ceph-secondary, conf:/etc/ceph/ceph-secondary.conf, user:cinder
rbd_max_clone_depth = 5
rbd_flatten_volume_from_snapshot = False
rbd_uuid = 6842adea-8feb-43ac-98d5-0eb89b19849b
rbd_user = cinder
rbd_pool = volumes
rbd_ceph_conf =
volume_driver = cinder.volume.drivers.rbd.RBDDriver
volume_backend_name = ceph

NOTE: I would recommend having the same user and credentials in both clusters, otherwise we’ll have problems when connecting from Nova. In our case we handle this in local.sh copying the cinder user from the primary to the secondary.

Multiple replication backends can be defined by adding more replication_device entries to the configuration.

Assuming that our backend name for the Ceph driver is ceph, creating the volume type would be like this:

user@localhost:$ cinder type-create replicated

user@localhost:$ cinder type-key replicated set volume_backend_name=ceph

user@localhost:$ cinder type-key replicated set replication_enabled=' True'

Configure Nova

We don’t want Nova to use Ceph for ephemeral volumes, because if we did we would need another DevStack deployment to test the contents of the attached volume after the failover. So we need to manually configure the rbd_user and rbd_secret_uuid in nova.conf since DevStack doesn’t do it when we set REMOTE_CEPH=True and ENABLE_CEPH_NOVA=False like we are doing.

In our automated deployment we handle this in local.conf, but it could be added manually to nova.conf in the [libvirt] section. The rbd_user we’ll be using is cinder, and the rbd_secret_uuid is created by DevStack, but we can see it using the virsh command:

user@localhost:$ sudo virsh secret-list
UUID Usage
--------------------------------------------------------------------------------
0fb7d33c-d87b-4edf-a89a-76e0412a103e ceph client.cinder secret

Automated deployment

Since the process is quite long and all these manual steps can be easily automated, that’s what I did, automate them in such a way that just a couple of lines will get me what I want:

  • One VM with my primary Ceph cluster
  • One VM with my secondary Ceph cluster
  • One VM with my all-in-one DevStack
  • A volume type for replicated volumes
  • Everything configured and running

To achieve this I used Vagrant with the libvirt plugin, and some simple bash scripts. If you are using VirtualBox or other hypervisor you can easily adapt the Vagrantfile.

The Vagrantfile by default uses local hypervisor and online configuration files, but this can be changed using variables USE_LOCAL_DATA and REMOTE_PROVIDER.

The steps are these:

  • Create a directory to hold our vagrant box
  • Download vagrant configuration file
  • Bring the VMs up with the basic provisioning
  • Copy Ceph clusters ssh keys around
  • Deploy devstack

Which can be achieved with:

user@localhost:$ mkdir replication

user@localhost:$ cd replication

user@localhost:$ curl -O https://gorka.eguileor.com/files/cinder/rbd_replication/Vagrantfile

user@localhost:$ vagrant up

user@localhost:$ vagrant ssh -c '/home/vagrant/copy_ceph_credentials.sh' -- -l root

user@localhost:$ vagrant ssh -c 'cd devstack && ./stack.sh'

Since some of these steps take a long time and I don’t like to pay much attention to the process, I usually run all this with a one-liner:

user@localhost:$ mkdir replication; cd replication; curl -O https://gorka.eguileor.com/files/cinder/rbd_replication/Vagrantfile; vagrant up | tee r.log && vagrant ssh -c 'sudo ~/copy_ceph_credentials.sh; cd devstack && ./stack.sh' | tee -a r.log

I could have done all this using Ansible, but I didn’t want to add more requirements for the tests

What we get

Default VM is devstack, so once this has finished you will be able to access your 3 VMs with:

user@localhost:$ vagrant ssh

user@localhost:$ vagrant ssh ceph-primary

user@localhost:$ vagrant ssh ceph-secondary

Or from within any of the nodes you can do:

user@localhost:$ sudo ssh devstack

user@localhost:$ sudo ssh ceph-primary

user@localhost:$ sudo ssh ceph-secondary

And you can easily query your clusters from the devstack VM:

user@localhost:$ sudo rbd mirror pool info volumes

user@localhost:$ sudo rbd --cluster ceph-secondary mirror pool info volumes

NOTE: The cluster name provided in –cluster must match the configuration file in /etc/ceph as this parameter is only used to locate the file and has nothing to do with the actual cluster name that is specified within the configuration file. So both clusters will actually have the “Ceph” name -which is the default- even if we use ceph-secondary to run the commands from the devstack node.

Once the deployment has completed you’ll have your clusters mirroring and the replicated volume type already available:

user@localhost:$ vagrant ssh

user@localhost:$ source ~/devstack/openrc admin admin

user@localhost:$ cinder type-list
+--------------------------------------+------------+-------------+-----------+
| ID | Name | Description | Is_Public |
+--------------------------------------+------------+-------------+-----------+
| 4f30dfdc-ef93-4a63-92a6-37adf1e053dc | replicated | - | True |
| a97e694a-8fa0-4eb3-9303-0240af374092 | lvm | - | True |
| f78397c8-2725-48df-a734-d7b4eef86edf | ceph | - | True |
+--------------------------------------+------------+-------------+-----------+

And you can unstack.sh your deployment and stack.sh it again, and each time the volumes pool from the secondary Ceph cluster will be removed just like the primary and a new one will be created, mirroring links will be created from the primary to the secondary and vice versa, and the replicated volume type will be created. All this thanks to the custom local.sh script.

Tests

OK, we have deployed everything and we have our 3 VMs, so it’s time to run some tests.

The tests in the following sections assume you are already connected to the DevStack node and have set the right environment variables to run OpenStack commands:

user@localhost:$ vagrant ssh

user@localhost:$ source ~/devstack/openrc admin admin

1. Sanity Checks

1.1 – Service

The most basic sanity check to do in Cinder is to confirm that the Cinder Volume service is up and running, since a misconfiguration will make it report as being down and we’ll have to go to /opt/stack/logs/c-vol.log to see what’s wrong.

user@localhost:$ cinder service-list
+------------------+---------------+------+---------+-------+----------------------------+-----------------+
| Binary | Host | Zone | Status | State | Updated_at | Disabled Reason |
+------------------+---------------+------+---------+-------+----------------------------+-----------------+
| cinder-backup | devstack | nova | enabled | up | 2016-09-21T14:18:52.000000 | - |
| cinder-scheduler | devstack | nova | enabled | up | 2016-09-21T14:18:59.000000 | - |
| cinder-volume | devstack@ceph | nova | enabled | up | 2016-09-21T14:18:56.000000 | - |
| cinder-volume | devstack@lvm | nova | enabled | up | 2016-09-21T14:18:59.000000 | - |
+------------------+---------------+------+---------+-------+----------------------------+-----------------+

1.2 – Create Ceph volume

Now we want to check that we can create a normal Ceph volume, confirming that Cinder is properly accessing the primary cluster not only by looking at the volume status, but also checking the primary’s volumes pool.

user@localhost:$ cinder create --volume-type ceph --name normal-ceph 1
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-09-21T14:19:23.000000 |
| description | None |
| encrypted | False |
| id | 4dab45a2-34d6-497d-a5e3-40dd34015264 |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | normal-ceph |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 8b6e0a34d91a446d8719acdf4145a9d4 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | 0358f5a11feb4585856694a076bd86b0 |
| volume_type | ceph |
+--------------------------------+--------------------------------------+

user@localhost:$ cinder list
+--------------------------------------+-----------+-------------+------+-------------+----------+-------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+-------------+------+-------------+----------+-------------+
| 4dab45a2-34d6-497d-a5e3-40dd34015264 | available | normal-ceph | 1 | ceph | false | |
+--------------------------------------+-----------+-------------+------+-------------+----------+-------------+

user@localhost:$ sudo rbd ls -l volumes
NAME SIZE PARENT FMT PROT LOCK
volume-4dab45a2-34d6-497d-a5e3-40dd34015264 1024M 2

1.3 – Delete Ceph volume

To finish the sanity check we want to confirm that deletion of Ceph volumes also work in our deployment.

user@localhost:$ cinder delete normal-ceph
Request to delete volume normal-ceph has been accepted.

user@localhost:$ cinder list
+----+--------+------+------+-------------+----------+-------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+----+--------+------+------+-------------+----------+-------------+
+----+--------+------+------+-------------+----------+-------------+

user@localhost:$ sudo rbd ls -l volumes

user@localhost:$

Ok, so at least the basic access to Ceph is working correctly and we can continue with the replication tests.

2. Testing Replication

2.1 – Create replicated volume

We’ll now create a replicated volume and check that it gets created first on the primary cluster and after a little bit it’s also available on the secondary cluster.

Please be patient with the replication, as it may not be instantaneous.

user@localhost:$ cinder create --volume-type replicated --name replicated-ceph 1
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-09-21T14:27:08.000000 |
| description | None |
| encrypted | False |
| id | e44311dd-8977-4728-a773-0695deca00fc |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | replicated-ceph |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 8b6e0a34d91a446d8719acdf4145a9d4 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | 0358f5a11feb4585856694a076bd86b0 |
| volume_type | replicated |
+--------------------------------+--------------------------------------+


user@localhost:$ sudo rbd ls -l volumes
NAME SIZE PARENT FMT PROT LOCK
volume-e44311dd-8977-4728-a773-0695deca00fc 1024M 2


user@localhost:$ sudo rbd --cluster ceph-secondary ls volumes


user@localhost:$ sudo rbd mirror image status volumes/volume-e44311dd-8977-4728-a773-0695deca00fc
volume-e44311dd-8977-4728-a773-0695deca00fc:
global_id: a2d50987-76f9-48a7-b879-3ec19b0a09cf
state: down+unknown
description: status not found
last_update: 1970-01-01 03:00:00


user@localhost:$ sudo rbd --cluster ceph-secondary mirror image status volumes/volume-e44311dd-8977-4728-a773-0695deca00fc
rbd: error opening image volume-e44311dd-8977-4728-a773-0695deca00fc: (2) No such file or directory

user@localhost:$ sudo rbd mirror image status volumes/volume-e44311dd-8977-4728-a773-0695deca00fc
volume-e44311dd-8977-4728-a773-0695deca00fc:
global_id: a2d50987-76f9-48a7-b879-3ec19b0a09cf
state: up+stopped
description: remote image is non-primary or local image is primary
last_update: 2016-09-21 17:27:36


user@localhost:$ sudo rbd --cluster ceph-secondary ls -l volumes
NAME SIZE PARENT FMT PROT LOCK
volume-e44311dd-8977-4728-a773-0695deca00fc 1024M 2 excl


user@localhost:$ sudo rbd --cluster ceph-secondary mirror image status volumes/volume-e44311dd-8977-4728-a773-0695deca00fc
volume-e44311dd-8977-4728-a773-0695deca00fc:
global_id: a2d50987-76f9-48a7-b879-3ec19b0a09cf
state: up+replaying
description: replaying, master_position=[object_number=3, tag_tid=1, entry_tid=3], mirror_position=[object_number=3, tag_tid=1, entry_tid=3], entries_behind_master=0
last_update: 2016-09-21 17:30:06


user@localhost:$ cinder list
+--------------------------------------+-----------+-----------------+------+-------------+----------+-------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+-----------------+------+-------------+----------+-------------+
| e44311dd-8977-4728-a773-0695deca00fc | available | replicated-ceph | 1 | replicated | false | |
+--------------------------------------+-----------+-----------------+------+-------------+----------+-------------+


user@localhost:$ cinder show replicated-ceph
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-09-21T14:27:08.000000 |
| description | None |
| encrypted | False |
| id | e44311dd-8977-4728-a773-0695deca00fc |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | replicated-ceph |
| os-vol-host-attr:host | devstack@ceph#ceph |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 8b6e0a34d91a446d8719acdf4145a9d4 |
| replication_status | enabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | available |
| updated_at | 2016-09-21T14:27:09.000000 |
| user_id | 0358f5a11feb4585856694a076bd86b0 |
| volume_type | replicated |
+--------------------------------+--------------------------------------+

NOTE: replication_status should be ignored in the response from the create request, as this will be updated after the driver has completed the volume creation. We can see the updated value in the response to the show command.

When checking the image status of the volume being created we’ve seen 3 different results in the state and description fields:

When the volume is not relicated yet:

  state:       down+unknown
  description: status not found

When the volume has been replicated and we are checking the primary:

  state:       up+stopped
  description: remote image is non-primary or local image is primary

When the volume has been replicated and we are checking a replica:

  state:       up+replaying
  description: replaying, master_position=[object_number=3, tag_tid=1, entry_tid=3], mirror_position=[object_number=3, tag_tid=1, entry_tid=3], entries_behind_master=0

If we had checked the non replicated volume we created during our sanity check we would have seen that while it also says down+unkown in the same way the volume that was going to be replicated did, it does not show any global_id:

user@localhost:$ sudo rbd mirror image status volumes/volume-4dab45a2-34d6-497d-a5e3-40dd34015264
volume-4dab45a2-34d6-497d-a5e3-40dd34015264:
global_id:
state: down+unknown
description: status not found
last_update: 1970-01-01 03:00:00

And if we hadn’t delete it and checked the info of both volumes we’d have seen that the non clustered volume does not have journaling enabled -although this will depend on how we deployed our cluster- and that the replicated volume has additional key/value pairs related to mirroring.

user@localhost:$ sudo rbd info volumes/volume-4dab45a2-34d6-497d-a5e3-40dd34015264
rbd image 'volume-4dab45a2-34d6-497d-a5e3-40dd34015264':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.10331190cde7
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:


user@localhost:$ sudo rbd info volumes/volume-e44311dd-8977-4728-a773-0695deca00fc
rbd image 'volume-e44311dd-8977-4728-a773-0695deca00fc':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.105522221a70
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling
flags:
journal: 105522221a70
mirroring state: enabled
mirroring global id: a2d50987-76f9-48a7-b879-3ec19b0a09cf
mirroring primary: true


user@localhost:$ sudo rbd --cluster ceph-secondary info volumes/volume-e44311dd-8977-4728-a773-0695deca00fc
rbd image 'volume-e44311dd-8977-4728-a773-0695deca00fc':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.101941b71efb
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling
flags:
journal: 101941b71efb
mirroring state: enabled
mirroring global id: a2d50987-76f9-48a7-b879-3ec19b0a09cf
mirroring primary: false

2.2 – Create replicated snapshot

Now we’ll go ahead and create a snapshot of the replicated volume and confirm that this snapshot will also be available on the secondary Ceph cluster.

user@localhost:$ cinder snapshot-create --name replicated-snapshot replicated-ceph
+-------------+--------------------------------------+
| Property | Value |
+-------------+--------------------------------------+
| created_at | 2016-09-21T14:35:51.232338 |
| description | None |
| id | 6aca1779-619f-4222-86f2-037d50d932d3 |
| metadata | {} |
| name | replicated-snapshot |
| size | 1 |
| status | creating |
| updated_at | None |
| volume_id | e44311dd-8977-4728-a773-0695deca00fc |
+-------------+--------------------------------------+


user@localhost:$ sudo rbd snap ls volumes/volume-e44311dd-8977-4728-a773-0695deca00fc
SNAPID NAME SIZE
10 snapshot-6aca1779-619f-4222-86f2-037d50d932d3 1024 MB


user@localhost:$ sudo rbd --cluster ceph-secondary snap ls volumes/e44311dd-8977-4728-a773-0695deca00fc
SNAPID NAME SIZE
10 snapshot-6aca1779-619f-4222-86f2-037d50d932d3 1024 MB


user@localhost:$ cinder snapshot-list
+--------------------------------------+--------------------------------------+-----------+---------------------+------+
| ID | Volume ID | Status | Name | Size |
+--------------------------------------+--------------------------------------+-----------+---------------------+------+
| 6aca1779-619f-4222-86f2-037d50d932d3 | e44311dd-8977-4728-a773-0695deca00fc | available | replicated-snapshot | 1 |
+--------------------------------------+--------------------------------------+-----------+---------------------+------+

2.3 – Delete replicated snapshot & volume

Now that we’ve seen how new volumes and snapshots are replicated we want to confirm that deletion works as well, right? What’s more basic than that?

First we’ll delete the snapshot and confirm that it gets deleted on both Ceph clusters:

user@localhost:$ cinder snapshot-delete replicated-snapshot

user@localhost:$ sudo rbd snap ls volumes/volume-e44311dd-8977-4728-a773-0695deca00fc

user@localhost:$ sudo rbd --cluster ceph-secondary snap ls volumes/e44311dd-8977-4728-a773-0695deca00fc

user@localhost:$

Now that there is no snapshot on the volume we’ll proceed to delete the replicated volume.

user@localhost:$ cinder delete replicated-ceph
Request to delete volume replicated-ceph has been accepted.


user@localhost:$ sudo rbd ls -l volumes


user@localhost:$ sudo rbd --cluster ceph-secondary ls -l volumes
NAME SIZE PARENT FMT PROT LOCK
volume-e44311dd-8977-4728-a773-0695deca00fc 1024M 2


user@localhost:$ sudo rbd --cluster ceph-secondary ls volumes

user@localhost:$

2.4 – Setup resources for failover

To test that RBD replication is actually working properly within the Cinder service we’ll want to test the failover mechanism, so we’ll create a couple of resources to test that everything works as expected:

  • Non replicated volume
  • Replicated available volume with snapshot
  • Replicated in-use volume with a file with data (numbers 1 to 100)

So we’ll set them up first:

user@localhost:$ cinder create --volume-type ceph --name normal-ceph 1
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-09-21T14:45:01.000000 |
| description | None |
| encrypted | False |
| id | 1466c553-35ee-419a-9d5c-cbe36c22aed0 |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | normal-ceph |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 8b6e0a34d91a446d8719acdf4145a9d4 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | 0358f5a11feb4585856694a076bd86b0 |
| volume_type | ceph |
+--------------------------------+--------------------------------------+


user@localhost:$ cinder create --volume-type replicated --name replicated-ceph-snapshot 1
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-09-21T14:47:35.000000 |
| description | None |
| encrypted | False |
| id | 4950f13d-e95d-41ea-85ae-80f0ac22aa0e |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | replicated-ceph-snapshot |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 8b6e0a34d91a446d8719acdf4145a9d4 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | 0358f5a11feb4585856694a076bd86b0 |
| volume_type | replicated |
+--------------------------------+--------------------------------------+


user@localhost:$ cinder snapshot-create replicated-ceph-snapshot --name replicated-snapshot
+-------------+--------------------------------------+
| Property | Value |
+-------------+--------------------------------------+
| created_at | 2016-09-21T14:47:48.931379 |
| description | None |
| id | c3d947e6-d2d6-47fb-9a54-41a6baf40e63 |
| metadata | {} |
| name | replicated-snapshot |
| size | 1 |
| status | creating |
| updated_at | None |
| volume_id | 4950f13d-e95d-41ea-85ae-80f0ac22aa0e |
+-------------+--------------------------------------+


user@localhost:$ nova keypair-add --pub-key ~/.ssh/id_rsa.pub mykey


user@localhost:$ nova keypair-list
+-------+------+-------------------------------------------------+
| Name | Type | Fingerprint |
+-------+------+-------------------------------------------------+
| mykey | ssh | dd:3b:b8:2e:85:04:06:e9:ab:ff:a8:0a:c0:04:6e:d6 |
+-------+------+-------------------------------------------------+


user@localhost:$ nova secgroup-add-rule default tcp 22 22 0.0.0.0/0
WARNING: Command secgroup-add-rule is deprecated and will be removed after Nova 15.0.0 is released. Use python-neutronclient or python-openstackclient instead.
+-------------+-----------+---------+-----------+--------------+
| IP Protocol | From Port | To Port | IP Range | Source Group |
+-------------+-----------+---------+-----------+--------------+
| tcp | 22 | 22 | 0.0.0.0/0 | |
+-------------+-----------+---------+-----------+--------------+


user@localhost:$ nova secgroup-add-rule default icmp -1 -1 0.0.0.0/0
WARNING: Command secgroup-add-rule is deprecated and will be removed after Nova 15.0.0 is released. Use python-neutronclient or python-openstackclient instead.
+-------------+-----------+---------+-----------+--------------+
| IP Protocol | From Port | To Port | IP Range | Source Group |
+-------------+-----------+---------+-----------+--------------+
| icmp | -1 | -1 | 0.0.0.0/0 | |
+-------------+-----------+---------+-----------+--------------+


user@localhost:$ nova boot --flavor m1.nano --image cirros-0.3.4-x86_64-uec --key-name mykey --security-groups default --nic net-name=private myvm
+--------------------------------------+----------------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-SRV-ATTR:host | - |
| OS-EXT-SRV-ATTR:hostname | myvm |
| OS-EXT-SRV-ATTR:hypervisor_hostname | - |
| OS-EXT-SRV-ATTR:instance_name | |
| OS-EXT-SRV-ATTR:kernel_id | 10fcc280-2b42-4e5f-9e5d-3a0b0c220767 |
| OS-EXT-SRV-ATTR:launch_index | 0 |
| OS-EXT-SRV-ATTR:ramdisk_id | c0e20554-cd2c-40e8-a0da-6f749d72f126 |
| OS-EXT-SRV-ATTR:reservation_id | r-n0zl5kiu |
| OS-EXT-SRV-ATTR:root_device_name | - |
| OS-EXT-SRV-ATTR:user_data | - |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| adminPass | vMepeoN8PY6h |
| config_drive | |
| created | 2016-09-21T14:49:09Z |
| description | - |
| flavor | m1.nano (42) |
| hostId | |
| host_status | |
| id | 84b97e11-955a-4568-8e49-163e5477acf6 |
| image | cirros-0.3.4-x86_64-uec (a335801e-26fe-4ed9-8079-06230121e78a) |
| key_name | mykey |
| locked | False |
| metadata | {} |
| name | myvm |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| status | BUILD |
| tags | [] |
| tenant_id | 8b6e0a34d91a446d8719acdf4145a9d4 |
| updated | 2016-09-21T14:49:09Z |
| user_id | 0358f5a11feb4585856694a076bd86b0 |
+--------------------------------------+----------------------------------------------------------------+


user@localhost:$ cinder create --volume-type replicated --name ceph-attached 1
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-09-21T14:50:07.000000 |
| description | None |
| encrypted | False |
| id | 890313d7-c72b-4f97-b0c7-defd3641bfd4 |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | ceph-attached |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 8b6e0a34d91a446d8719acdf4145a9d4 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | 0358f5a11feb4585856694a076bd86b0 |
| volume_type | replicated |
+--------------------------------+--------------------------------------+


user@localhost:$ nova list
+--------------------------------------+------+--------+------------+-------------+--------------------------------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+------+--------+------------+-------------+--------------------------------------------------------+
| 84b97e11-955a-4568-8e49-163e5477acf6 | myvm | ACTIVE | - | Running | private=10.0.0.4, fde7:d51b:1893:0:f816:3eff:fe4e:818e |
+--------------------------------------+------+--------+------------+-------------+--------------------------------------------------------+


user@localhost:$ nova volume-attach myvm 890313d7-c72b-4f97-b0c7-defd3641bfd4
+----------+--------------------------------------+
| Property | Value |
+----------+--------------------------------------+
| device | /dev/vdb |
| id | 890313d7-c72b-4f97-b0c7-defd3641bfd4 |
| serverId | 84b97e11-955a-4568-8e49-163e5477acf6 |
| volumeId | 890313d7-c72b-4f97-b0c7-defd3641bfd4 |
+----------+--------------------------------------+


user@localhost:$ ssh -o StrictHostKeychecking=no cirros@10.0.0.4 "sudo su - -c 'seq 1 100 > /dev/vdb; head -c 292 /dev/vdb'"
Warning: Permanently added '10.0.0.4' (RSA) to the list of known hosts.
1
2
...
99
100


user@localhost:$ cinder list
+--------------------------------------+-----------+--------------------------+------+-------------+----------+--------------------------------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------------------+------+-------------+----------+--------------------------------------+
| 1466c553-35ee-419a-9d5c-cbe36c22aed0 | available | normal-ceph | 1 | ceph | false | |
| 4950f13d-e95d-41ea-85ae-80f0ac22aa0e | available | replicated-ceph-snapshot | 1 | replicated | false | |
| 890313d7-c72b-4f97-b0c7-defd3641bfd4 | in-use | ceph-attached | 1 | replicated | false | 84b97e11-955a-4568-8e49-163e5477acf6 |
+--------------------------------------+-----------+--------------------------+------+-------------+----------+--------------------------------------+


user@localhost:$ cinder snapshot-list
+--------------------------------------+--------------------------------------+-----------+---------------------+------+
| ID | Volume ID | Status | Name | Size |
+--------------------------------------+--------------------------------------+-----------+---------------------+------+
| c3d947e6-d2d6-47fb-9a54-41a6baf40e63 | 4950f13d-e95d-41ea-85ae-80f0ac22aa0e | available | replicated-snapshot | 1 |
+--------------------------------------+--------------------------------------+-----------+---------------------+------+


user@localhost:$ sudo rbd ls -l volumes
NAME SIZE PARENT FMT PROT LOCK
volume-1466c553-35ee-419a-9d5c-cbe36c22aed0 1024M 2
volume-4950f13d-e95d-41ea-85ae-80f0ac22aa0e 1024M 2
volume-4950f13d-e95d-41ea-85ae-80f0ac22aa0e@snapshot-c3d947e6-d2d6-47fb-9a54-41a6baf40e63 1024M 2 yes
volume-890313d7-c72b-4f97-b0c7-defd3641bfd4 1024M 2 excl


user@localhost:$ sudo rbd --cluster ceph-secondary ls -l volumes
NAME SIZE PARENT FMT PROT LOCK
volume-4950f13d-e95d-41ea-85ae-80f0ac22aa0e 1024M 2 excl
volume-4950f13d-e95d-41ea-85ae-80f0ac22aa0e@snapshot-c3d947e6-d2d6-47fb-9a54-41a6baf40e63 1024M 2 yes
volume-890313d7-c72b-4f97-b0c7-defd3641bfd4 1024M 2 excl

Everything looks fine, we have 3 volumes and 1 snapshot on the primary and 2 volumes and 1 snapshot on the secondary.

2.5 – Failover

There are two failover scenarios that are contemplated by the driver, when the primary cluster is still accesible and when a connection between the primary and secondary Ceph clusters is no longer possible. In the first case a clean promotion of the secondary cluster is possible and that’s how the failover will be done, and on the second case only force promotion is possible.

Since the main use case for Replication v2.1 is the Smoking Hole scenario we’ll be testing the failover when there’s no connection between the clusters and force promotion is the only available path. To achieve this we’ll be shutting down the network interface that is being used to link the clusters. Since this network interface is also being used to communicate with the DevStack node we will bring it down on the primary cluster.

NOTE: Currently there is a bug in the RBD mirror daemon that causes force-promoted volumes to remain in read-only mode until the daemon is restarted, so we’ll need a workaround for the tests until this gets fixed. Update: It looks like the fix has already been merged, so the workaround is probably no longer necessary.

So let’s bring the network down on the primary cluster.

user@localhost:$ sudo ssh ceph-primary ifconfig
eth0 Link encap:Ethernet HWaddr 52:54:00:7a:af:7c
inet addr:192.168.121.241 Bcast:192.168.121.255 Mask:255.255.255.0
inet6 addr: fe80::5054:ff:fe7a:af7c/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:115697 errors:0 dropped:0 overruns:0 frame:0
TX packets:35908 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:168262393 (168.2 MB) TX bytes:2675665 (2.6 MB)

eth1 Link encap:Ethernet HWaddr 52:54:00:09:45:41
inet addr:10.0.1.11 Bcast:10.0.1.255 Mask:255.255.255.0
inet6 addr: fe80::5054:ff:fe09:4541/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:75007 errors:0 dropped:0 overruns:0 frame:0
TX packets:59203 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:64542899 (64.5 MB) TX bytes:70810095 (70.8 MB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:38973 errors:0 dropped:0 overruns:0 frame:0
TX packets:38973 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:6757038 (6.7 MB) TX bytes:6757038 (6.7 MB)


user@localhost:$ sudo ssh 192.168.121.241 'ifconfig eth1 down'


user@localhost:$ ping ceph-primary -c 1 -w 2
PING ceph-primary (10.0.1.11) 56(84) bytes of data.

--- ceph-primary ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1006ms

NOTE: We’ll still be able to access the primary cluster using eth0 IP -192.168.121.241- if we want to bring the interface back up.

The manual failover is performed using command cinder failover-host and passing as argument the host of the service where the failover is being performed.

user@localhost:$ cinder service-list --binary cinder-volume --withreplication
+---------------+---------------+------+---------+-------+----------------------------+--------------------+-------------------+--------+-----------------+
| Binary | Host | Zone | Status | State | Updated_at | Replication Status | Active Backend ID | Frozen | Disabled Reason |
+---------------+---------------+------+---------+-------+----------------------------+--------------------+-------------------+--------+-----------------+
| cinder-volume | devstack@ceph | nova | enabled | up | 2016-09-21T15:11:03.000000 | enabled | - | False | - |
| cinder-volume | devstack@lvm | nova | enabled | up | 2016-09-21T15:11:09.000000 | disabled | - | False | - |
+---------------+---------------+------+---------+-------+----------------------------+--------------------+-------------------+--------+-----------------+


user@localhost:$ cinder failover-host devstack@ceph

user@localhost:$ cinder service-list --binary cinder-volume --withreplication
+---------------+---------------+------+----------+-------+----------------------------+--------------------+-------------------+--------+-----------------+
| Binary | Host | Zone | Status | State | Updated_at | Replication Status | Active Backend ID | Frozen | Disabled Reason |
+---------------+---------------+------+----------+-------+----------------------------+--------------------+-------------------+--------+-----------------+
| cinder-volume | devstack@ceph | nova | disabled | up | 2016-09-21T15:13:54.000000 | failing-over | ceph-secondary | False | failed-over |
| cinder-volume | devstack@lvm | nova | enabled | up | 2016-09-21T15:13:49.000000 | disabled | - | False | - |
+---------------+---------------+------+----------+-------+----------------------------+--------------------+-------------------+--------+-----------------+


user@localhost:$ cinder service-list --binary cinder-volume --withreplication
+---------------+---------------+------+----------+-------+----------------------------+--------------------+-------------------+--------+-----------------+
| Binary | Host | Zone | Status | State | Updated_at | Replication Status | Active Backend ID | Frozen | Disabled Reason |
+---------------+---------------+------+----------+-------+----------------------------+--------------------+-------------------+--------+-----------------+
| cinder-volume | devstack@ceph | nova | disabled | up | 2016-09-21T15:13:54.000000 | failed-over | ceph-secondary | False | failed-over |
| cinder-volume | devstack@lvm | nova | enabled | up | 2016-09-21T15:13:49.000000 | disabled | - | False | - |
+---------------+---------------+------+----------+-------+----------------------------+--------------------+-------------------+--------+-----------------+

NOTE: Since the driver is trying to do a clean failover first we’ll have around 30 seconds delay in passing from failing-over state to failed-over.

After issuing the failover-host command we can see that the service got disabled, and will stay like that until we re-enable it.

We can check the progress of the failover in the c-vol screen window, and once it has been completed we’ll see something similar to this:

2016-... DEBUG cinder.utils [req-...] Failed attempt 3 from (pid=22483) _print_stop /opt/stack/cinder/cinder/utils.py:795
2016-... DEBUG cinder.utils [req-...] Have been at this for 36.022 seconds from (pid=22483) _print_stop /opt/stack/cinder/cinder/utils.py:797
2016-... DEBUG cinder.volume.drivers.rbd [req-...] Failed to demote {'volume': 'volume-4950f13d-e95d-41ea-85ae-80f0ac22aa0e', 'error': VolumeBackendAPIException(u'Bad or unexpected response from the storage volume backend API: Error connecting to ceph cluster.',)}(volume)s with error: Bad or unexpected response from the storage volume backend API: Error connecting to ceph cluster.. from (pid=22483) _demote_volumes /opt/stack/cinder/cinder/volume/drivers/rbd.py:1076
2016-... DEBUG cinder.volume.drivers.rbd [req-...] Skipping failover for non replicated volume volume-1466c553-35ee-419a-9d5c-cbe36c22aed0 with status: available from (pid=22483) _failover_volume /opt/stack/cinder/cinder/volume/drivers/rbd.py:1040
2016-... DEBUG cinder.volume.drivers.rbd [req-...] connecting to ceph-secondary (timeout=2). from (pid=22483) _connect_to_rados /opt/stack/cinder/cinder/volume/drivers/rbd.py:421
2016-... DEBUG cinder.volume.drivers.rbd [req-...] connecting to ceph-secondary (timeout=2). from (pid=22483) _connect_to_rados /opt/stack/cinder/cinder/volume/drivers/rbd.py:421
2016-... INFO cinder.volume.drivers.rbd [req-...] RBD driver failover completed.

If we look at the volumes we can see that non replicated normal-ceph volume is now on error state because it is not available on the secondary cluster.

user@localhost:$ cinder list
+--------------------------------------+-----------+--------------------------+------+-------------+----------+--------------------------------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------------------+------+-------------+----------+--------------------------------------+
| 1466c553-35ee-419a-9d5c-cbe36c22aed0 | error | normal-ceph | 1 | ceph | false | |
| 4950f13d-e95d-41ea-85ae-80f0ac22aa0e | available | replicated-ceph-snapshot | 1 | replicated | false | |
| 890313d7-c72b-4f97-b0c7-defd3641bfd4 | in-use | ceph-attached | 1 | replicated | false | 84b97e11-955a-4568-8e49-163e5477acf6 |
+--------------------------------------+-----------+--------------------------+------+-------------+----------+--------------------------------------+

And we can check the DB for some internal replication information where we can see the original status of the non replicated volume:

user@localhost:$ mysql cinder -e "select display_name, status, replication_status, replication_extended_status, replication_driver_data from volumes where not deleted;"
+--------------------------+-----------+--------------------+--------------------------------------------------------+---------------------------------------------------------------------------------+
| display_name | status | replication_status | replication_extended_status | replication_driver_data |
+--------------------------+-----------+--------------------+--------------------------------------------------------+---------------------------------------------------------------------------------+
| normal-ceph | error | disabled | {"status":"available","replication_status":"disabled"} | NULL |
| replicated-ceph-snapshot | available | failed-over | NULL | {"had_journaling":false} |
| ceph-attached | in-use | failed-over | NULL | {"had_journaling":false} |
+--------------------------+-----------+--------------------+--------------------------------------------------------+---------------------------------------------------------------------------------+

Since the rbd-mirror workaround mentioned earlier may not be necessary anymore we’ll first confirm if the rbd-mirror is still watching the volume and setting it as read-only, then we’ll restart the daemon it it is, and finally confirm the volume is no longer being watched by the daemon. The following sequence illustrates the case where the bug was still affecting rbd-mirror:

user@localhost:$ sudo rbd --cluster ceph-secondary ls -l volumes
NAME SIZE PARENT FMT PROT LOCK
volume-4950f13d-e95d-41ea-85ae-80f0ac22aa0e 1024M 2 excl
volume-4950f13d-e95d-41ea-85ae-80f0ac22aa0e@snapshot-c3d947e6-d2d6-47fb-9a54-41a6baf40e63 1024M 2 yes
volume-890313d7-c72b-4f97-b0c7-defd3641bfd4 1024M 2 excl


user@localhost:$ sudo rbd --cluster ceph-secondary status volumes/volume-4950f13d-e95d-41ea-85ae-80f0ac22aa0e
Watchers:
watcher=10.0.1.12:0/3468660901 client.4121 cookie=139940840521952


user@localhost:$ sudo rbd --cluster ceph-secondary status volumes/volume-890313d7-c72b-4f97-b0c7-defd3641bfd4
Watchers:
watcher=10.0.1.12:0/3468660901 client.4121 cookie=139940840635408


user@localhost:$ sudo ssh ceph-secondary 'systemctl kill -s6 ceph-rbd-mirror@admin'


user@localhost:$ sudo ssh ceph-secondary 'systemctl kill -s6 ceph-rbd-mirror@admin'


user@localhost:$ sudo rbd --cluster ceph-secondary status volumes/volume-4950f13d-e95d-41ea-85ae-80f0ac22aa0e
Watchers: none


user@localhost:$ sudo ssh ceph-secondary 'systemctl start ceph-rbd-mirror@admin'


user@localhost:$ sudo rbd --cluster ceph-secondary status volumes/volume-4950f13d-e95d-41ea-85ae-80f0ac22aa0e
Watchers: none


user@localhost:$ sudo rbd --cluster ceph-secondary status volumes/volume-890313d7-c72b-4f97-b0c7-defd3641bfd4
Watchers: none

2.6 – Freeze functionality

According to the specs and the devref the freeze functionality should prevent deletes and extends, but as I realized during these tests, this is not working. I have opened a bug.

For the time being only operations going through the scheduler will be prevented, so it’s basically the same as disabling the service. So I don’t see much point in testing this since the driver has nothing to do here, it’s all done on the scheduler.

2.7 – Deleting failedover volume & snapshot

Let’s confirm we can delete replicated snapshots and volumes once we have failed over.

user@localhost:$ cinder snapshot-delete replicated-snapshot

user@localhost:$ sudo rbd --cluster ceph-secondary ls -l volumes
NAME SIZE PARENT FMT PROT LOCK
volume-4950f13d-e95d-41ea-85ae-80f0ac22aa0e 1024M 2
volume-890313d7-c72b-4f97-b0c7-defd3641bfd4 1024M 2 excl

user@localhost:$ cinder delete replicated-ceph-snapshot
NAME SIZE PARENT FMT PROT LOCK
volume-890313d7-c72b-4f97-b0c7-defd3641bfd4 1024M 2 excl

2.8 – Reattaching volume

As mentioned earlier attached volumes are a special case that require manual detaching and reattaching, or at least that’s the procedure mentioned in the spec and what seems logical to me. Unfortunately Nova -as far as I know- lacks the ability to detach a volume, at least an RBD volume, from an instance if the backend is not accessible. So we’ll just attach the volume to the same instance again (but we could create a new instance and attach it to that one) after changing status and attach_status of the volume.

In a real Smoke Hole situation we probably wouldn’t have to detach the volume since our computes would most likely be also dead.

After attaching the volume from the secondary Ceph cluster we’ll check the contents to confirm that the data we wrote on the primary is also there.

user@localhost:$ cinder reset-state ceph-attached --state available --attach-status detached


user@localhost:$ nova volume-attach myvm 890313d7-c72b-4f97-b0c7-defd3641bfd4
+----------+--------------------------------------+
| Property | Value |
+----------+--------------------------------------+
| device | /dev/vdc |
| id | 890313d7-c72b-4f97-b0c7-defd3641bfd4 |
| serverId | 84b97e11-955a-4568-8e49-163e5477acf6 |
| volumeId | 890313d7-c72b-4f97-b0c7-defd3641bfd4 |
+----------+--------------------------------------+


user@localhost:$ ssh -o StrictHostKeychecking=no cirros@10.0.0.4 "sudo su - -c 'head -c 292 /dev/vdc'"
1
2
...
99
100

2.9 – Creating volumes

Now we’ll confirm that after enabling the service we can still create volumes, both normal and replicated.

NOTE: It is the system administrator’s responsibility to know if allowing the creation of volumes after a failover is a good idea or not, since non-replicated volumes will not be available on the primary cluster if we ever failback.

user@localhost:$ cinder service-enable devstack@ceph cinder-volume
+---------------+---------------+---------+
| Host | Binary | Status |
+---------------+---------------+---------+
| devstack@ceph | cinder-volume | enabled |
+---------------+---------------+---------+


user@localhost:$ cinder create --volume-type ceph --name failedover-normal 1
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-09-21T18:29:28.000000 |
| description | None |
| encrypted | False |
| id | c39a8c16-210e-45ff-8453-20f527f9fbd1 |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | failedover-normal |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 8b6e0a34d91a446d8719acdf4145a9d4 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | 0358f5a11feb4585856694a076bd86b0 |
| volume_type | ceph |
+--------------------------------+--------------------------------------+


user@localhost:$ sudo rbd --cluster ceph-secondary mirror image status volumes/volume-c39a8c16-210e-45ff-8453-20f527f9fbd1
volume-c39a8c16-210e-45ff-8453-20f527f9fbd1:
global_id:
state: down+unknown
description: status not found
last_update: 1970-01-01 03:00:00


user@localhost:$ cinder create --volume-type replicated --name failedover-replicated 1
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-09-21T18:34:21.000000 |
| description | None |
| encrypted | False |
| id | 0e8c732a-0a4a-4d8f-bd67-b337a2e26954 |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | failedover-replicated |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 8b6e0a34d91a446d8719acdf4145a9d4 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | 0358f5a11feb4585856694a076bd86b0 |
| volume_type | replicated |
+--------------------------------+--------------------------------------+


user@localhost:$ sudo rbd --cluster ceph-secondary mirror image status volumes/volume-0e8c732a-0a4a-4d8f-bd67-b337a2e26954
volume-0e8c732a-0a4a-4d8f-bd67-b337a2e26954:
global_id: 14348cf5-3b4b-4ec0-8fb7-7b5fc9836ba9
state: down+unknown
description: status not found
last_update: 1970-01-01 03:00:00

We can see that the non replicated volume has no global_id unlike the replicated volume as expected, and both are down+unknown since they are not currently replicated anywhere.

Future work

You are probably thinking that there’s something missing in these tests, where’s the failback?. Well, we don’t have a failback mechanism in Cinder yet. We have specs for the failback but no implementation, so I cannot test it.

After playing around with replication I see there are a couple of places that require further work in the replication feature but what is there seems to work fine:


Picture: “Peeking” by niXerKG is licensed under CC BY-NC 2.0


Leave a comment

Your email address will not be published. Required fields are marked *

9 thoughts on “Cinder’s Ceph Replication Sneak peek

  • Kemo

    Thanks for your work. I tried to repeat this scenario unfortunately without success. I need some help. As a first I would like to know which host operating system are you using and which version of Vagrant, vagrant-libvirt, devstack and ceph. Thanks for answer.

    • geguileo Post author

      Sorry to hear you are having problems to test this out. Versions used are:

      – Vagrant: 1.8.1
      – vagrant-libvirt: 0.0.32
      – devstack: master
      – ceph: nightly build

      The problem may be related with the local.conf file being out of date, as it is referencing to patch number 15 while the latest patch is 19. I have updated the file in case that helps.

      • Kemo

        I’m using Ubuntu16.04 virtual machine as my host. I installed Vagrant 1.8.1 with vagrant-libvirt: 0.0.33 ( version 32 doesn’t work). Basic vagrants scenario work.
        I downloaded your Vagrantfile from https://gorka.eguileor.com/files/cinder/rbd_replication/Vagrantfile and change all http://gorka.eguileor.com/… to https://gorka.eguileor.com/… in it.
        After vagrant up the following error occurred:
        File upload source file /home/kpog/.ssh/vagrant_insecure_key.pub must exist
        How can I get this key?

          • Kemo

            I’m going further step by step but I’m still not at the end.
            I encountered some problems:
            – In provision_ceph.sh script keyring files should be copied to /etc/ceph on ceph-primary and ceph-secondary
            – copy_ceph_credentials.sh script can not be run as a root

            Now my ceph clusters are running fine, but have some problems with devstack. On ceph-primary and ceph-secondary version 10.2.5 was installed, but on devstack ceph 11.1.0. No ceph commands work there, because of internal error:

            vagrant@devstack:~/devstack$ ceph -c /etc/ceph/ceph.conf osd pool ls
            /srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.1.0-6032-g8d7d5ae/src/mon/MonMap.cc: In function ‘void MonMap::sanitize_mons(std::map<std::__cxx11::basic_string, entity_addr_t>&)’ thread 7f74effff700 time 2016-12-16 17:13:53.811569
            /srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.1.0-6032-g8d7d5ae/src/mon/MonMap.cc: 70: FAILED assert(mon_info.count(p.first))
            ceph version 11.1.0-6032-g8d7d5ae (8d7d5ae3ff601153cfe6724cceb0fe767557344d)

            Is it possible somehow to tell devstack to install stable version of ceph?

          • Kemo

            Today I finally succeeded. Will make some more tests tomorrow. I made some changes in your scripts.
            Can send you those scripts, if you want.

              • Kemo

                Here they are:
                https://gitlab.com/kemopq/CindersCephReplicationAutDeployment.git

                As a next step I also played with two Openstacks each connected on its own ceph cluster and rdb replication between them. Unfortunately Cinder does not support this scenario and it seems no work is planned on that. So I made it by hand. Some Cinder DB data was copied from primary to secondary site, switchover was done by Ceph commands (demote, promote), replicated volume was then attached to app on secondary site.
                On my opinion this solution was even more appealing and can be used as a part of Disaster Recovery as a Service solution on Openstack.
                If somebody is interested on that issue, my email address is klemen@psi-net.si

                • geguileo Post author

                  Thank you for the link!
                  I agree, for the Disaster Recovery use case in OpenStack, Cinder’s replication could be considered just the first step, keeping your data safe; but then there are many other things to consider on a DR scenario, among others are DB synchronization, Network reconstruction, Image synchronization between sites, DB transformation to reflect new states (Nova device mappings won’t match anymore), etc.
                  And the proper coordination of all those is a big endeavor that is already been work on, even if there may not be a definitive solution yet.