Manual validation of Cinder A/A patches 4


In the Cinder Midcycle I agreed to create some sort of document explaining the manual tests I’ve been doing to validate the work on Cinder’s Active-Active High Availability -as a starting point for other testers and for the automation of the tests- and writing a blog post was the most convenient way for me to do so, so here it is.

checklist

Scope

The Active-Active High Availability work in Cinder is formed by a good number of specs and patches, and most of them have not yet been merged and some have not even been created, yet we are at a point were we can start testing things to catch bugs and performance bottlenecks as soon as possible.

We have merged in master -Newton cycle- most of the DLM work and all of the patches that form the foundation needed for the new job distribution and cleanup mechanisms, but we decided not to include in this cycle any patches that changed the way we do the job distribution or the cleanup since those also affect non clustered deployments; we wanted to be really sure we are not introducing any bugs in normal deployments.

The scope of the tests I’m going to be discussing in this post is limited to the job distribution and cleanup mechanism using the Tooz library with local file locks instead of a DLM. This way we’ll be able to follow a classic crawl-walk-run approach where we first test these mechanisms ignoring the DLM variable from the equation, together with the potential configuration and communication issues it brings. Later we’ll add the DLM as well as connection failures simulation.

Since the two mechanisms to test are highly intertwined we’ll be testing them both at the same time.

The explanation provided in this post is not only useful for testing existing code, but it’s also interesting for driver maintainers, as they can start testing their drivers to confirm they are ready for the Active-Active work. It is true that this would be a very basic check -since proper tests require of a DLM and having the services deployed in different hosts- but it would allow them to get familiar with it and catch the most obvious issues.

It is important that all driver maintainers are able to start working on this at the same time to ensure fair treatment for all, otherwise driver maintainers working on the A/A feature would have an unfair advantage.

Deployment

For the initial tests we want to keep things as simple as possible, so we’ll only have 1 cinder-api service -so no HAProxy configuration is needed-, only 1 scheduler (it’s really easy to add more, but it’ll make it harder to debug, so only 1 for now), we’ll not be using a DLM like we said earlier, and we will also be using local storage using the LVM driver -yes you read it right- to simplify our deployment. For this we’ll be using an all in one deployment using DevStack, as it will reduce configuration requirements since we don’t need services deployed in one host -or VM- to communicate with another host -or VM-. I know this sounds counter-intuitive, but this is good enough for now as you’ll see and in the near future we’ll expand this configuration to do more real tests.

To run 2 cinder-volume services in a clustered configuration under the same devstack deployment all you really need is to pull the latest patch in the HA A/A series from Gerrit and then run both services with the same cluster and different host configuration options. But since we are going to perform some additional tests it will be good to configure a little bit more.

Our DevStack configuration will do these things:

  • Download the Cinder component from the latest patch in the A/A branch (at the moment of this writing it’s the 6th patch for Scheduler’s Cosmetic Changes patch for Make Image Volume Cache cluster aware).
  • Use Cinder client from gerrit instead of pypi and use the latests patch in the A/A branch (at the moment of this writing it’s the Add service cleanup command
  • Set over subscription ration to 10.0 (since we won’t be really writing anything in most cases).
  • Configure the host parameter instead of using the default value.
  • Create 2 LVM backends of 5GB each
  • Set backends to use thin provisioning

So first we must edit the local.conf and make sure we have included these lines:

# Retrieve Cinder code from gerrit's A/A work
CINDER_REPO=https://review.openstack.org/p/openstack/cinder
CINDER_BRANCH=refs/changes/69/353069/16

# We want cinder client to be downloaded from Git instead of pypi
LIBS_FROM_GIT=python-cinderclient

# And we want to use Gerrit repo and download our patch
CINDERCLIENT_REPO=https://review.openstack.org/p/openstack/python-cinderclient
CINDERCLIENT_BRANCH=refs/changes/07/363007/3

# 5GB LVM backends
VOLUME_BACKING_FILE_SIZE=5125M

# 2 backends
CINDER_ENABLED_BACKENDS=${CINDER_ENABLED_BACKENDS:-lvm:lvmdriver-1,lvm:lvmdriver-2}

[[post-config|$CINDER_CONF]]
[DEFAULT]
# Don't use default host name
host = host1
[lvmdriver-1]
lvm_type = thin
lvm_max_over_subscription_ratio = 10.0
[lvmdriver-2]
lvm_type = thin
lvm_max_over_subscription_ratio = 10.0

For reference this is the local.conf file I use. If you want to use other storage backends you just need to adapt above configuration to the backend driver.

We didn’t configure the cluster option on purpose, don’t worry, we’ll do it later after we’ve done some tests.

You’ll need to create 2 files both with the cluster option -same value on both- and only one of them with the host option.

I used to run these commands manually after updating my VM and making sure it had git installed, they basically cloned devstack, downloaded my devstack configuration, deployed a devstack, created the 2 configuration files for later, created a new window with logging, left the command I’ll need to run to start the second service, and attach to the stack screen session.

user@localhost:$ git clone https://git.openstack.org/openstack-dev/devstack

user@localhost:$ cd devstack

user@localhost:$ curl -o local.conf http://gorka.eguileor.com/files/cinder/manual_ha_aa_local.conf

user@localhost:$ ./stack.sh

user@localhost:$ echo -e "[DEFAULT]\ncluster = mycluster" > /etc/cinder/host1.conf

user@localhost:$ echo -e "[DEFAULT]\ncluster = mycluster\nhost = host2" > /etc/cinder/host2.conf

user@localhost:$ screen -S stack -X screen -t c-vol2

user@localhost:$ screen -S stack -p c-vol2 -X logfile /opt/stack/logs/c-vol2.log

user@localhost:$ screen -S stack -p c-vol2 -X log on

user@localhost:$ touch /opt/stack/logs/c-vol2.log

user@localhost:$ screen -S stack -p c-vol2 -X stuff $'cinder-volume --config-file /etc/cinder/cinder.conf --config-file /etc/cinder/host2.conf & echo $! >/opt/stack/status/stack/c-vol2.pid; fg || echo "c-vol failed to start" | tee "/opt/stack/status/stack/c-vol2.failure"'

user@localhost:$ screen -x stack -p 0

But on the second time I had to do this I decided to automate the whole thing in my Vagrant’s provisioning aided with custom local.sh file that runs after devstack has finished deploying OpenStack.

So all you have to do is clone devstack, download local.conf and local.sh into devstack directory, set execution permissions on local.sh, and run stack.sh.

Code changes

There are some cases were Cinder may perform operations too fast for us to act on them, be it to check that something has occurred or to make Cinder do something, so we’ll be making some small modifications to cinder’s code to create some delays that will give us some leeway.

Required changes to the code are:

cinder/volume/utils.py

def introduce_delay(seconds, operation='-', resource_id='-'):
    for __ in range(seconds):
        time.sleep(1)
        LOG.debug(_('Delaying %(op)s operation on %(id)s.'),
                  {'op': operation, 'id': resource_id})

And then we need to introduce calls to it from cinder/volume/flows/manager/create_volume.py and cinder/volume/manager.py in create volume, delete volume, create snapshot and delete snapshot, so that we have a 30 seconds delay before actually performing the operation and in the case of doing a volume creation from an image, the delay should be right after we have changed the status to “downloading”.

We can do these code changes manually or we can just change the CINDER_BRANCH inside local.conf to point to a patch that I specifically created to introduce these delays for my manual tests.

CINDER_BRANCH=refs/changes/69/353069/12

If you are using my configuration you are already pointing to that patch.

1. Non clustered tests

My recommendation is to split the screen session so we can see multiple windows at the same time as it will allow us to execute commands and follow the flow from the API, SCH, and VOL nodes.

We all have our preferences, but when I’m working on these tests I usually have the screen session horizontally split in at least 5 sections -command line, c-api, c-sch, c-vol, c-vol2- and I tend to reorder my windows so the cinder windows are at the beginning in the same order as I listed going from 0 to 4, and having the c-back window as number 5, number 6 a mysql connection, and number 7 with a vim editor.

The reason why we didn’t add the cluster configuration when deploying DevStack is because we wanted to check the non clustered deployment first.

1.0 – Sanity checks

The first thing we should do, now that we have a DevStack running and before doing any tests, is do some checks that will allow us to do some basic sanity checks once we run 2 services in the same cluster:

  • Check that there are no clusters:
user@localhost:$ cinder --os-volume-api-version 3.11 cluster-list --detail
+------+--------+-------+--------+-----------+----------------+----------------+-----------------+------------+------------+
| Name | Binary | State | Status | Num Hosts | Num Down Hosts | Last Heartbeat | Disabled Reason | Created At | Updated at |
+------+--------+-------+--------+-----------+----------------+----------------+-----------------+------------+------------+
+------+--------+-------+--------+-----------+----------------+----------------+-----------------+------------+------------+

  • Check services and notice that the Cluster field is empty for all services:
user@localhost:$ cinder --os-volume-api-version 3.11 service-list
+------------------+-------------------+------+---------+-------+----------------------------+---------+-----------------+
| Binary | Host | Zone | Status | State | Updated_at | Cluster | Disabled Reason |
+------------------+-------------------+------+---------+-------+----------------------------+---------+-----------------+
| cinder-backup | host1 | nova | enabled | up | 2016-08-08T17:33:26.000000 | - | - |
| cinder-scheduler | host1 | nova | enabled | up | 2016-08-08T17:33:23.000000 | - | - |
| cinder-volume | host1@lvmdriver-1 | nova | enabled | up | 2016-08-08T17:33:30.000000 | - | - |
| cinder-volume | host1@lvmdriver-2 | nova | enabled | up | 2016-08-08T17:33:31.000000 | - | - |
+------------------+-------------------+------+---------+-------+----------------------------+---------+-----------------+

  • Check there’s no RabbitMQ cluster queue:
user@localhost:$ sudo rabbitmqctl list_queues name | grep cinder-volume.mycluster

user@localhost:$
  • Check that the workers table is empty:
user@localhost:$ mysql cinder -e 'select * from workers;'

user@localhost:$

1.1 – Creation

The most basic thing we need to test is that we are able to create a volume and that we are creating the workers table entry:

user@localhost:$ cinder create --name mydisk 1; sleep 3; mysql cinder -e 'select * from workers;'
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-08-08T17:52:47.000000 |
| description | None |
| encrypted | False |
| id | 16fcca48-8729-44ab-b024-ddd5cfd458a4 |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | mydisk |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 2e00c4d79a5f49708438f8d3761a6d3d |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | f78b297498774851a758b08385e39b77 |
| volume_type | lvmdriver-1 |
+--------------------------------+--------------------------------------+
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+
| created_at | updated_at | deleted_at | deleted | id | resource_type | resource_id | status | service_id |
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+
| 2016-08-08 17:52:47 | 2016-08-08 17:52:48 | NULL | 0 | 2 | Volume | 16fcca48-8729-44ab-b024-ddd5cfd458a4 | creating | 3 |
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+

user@localhost:$

We can see that we have a new entry in the workers table for the volume that is being created, and this operation is being performed by the service #3.

It is important that once the volume has been created we check that the workers table table is empty. Just remember this is going to take some time due to the delay we’ve introduced, but you’ll see a “Created volume successfully.” message on the c-vol window once it’s done.

user@localhost:$ cinder list
+--------------------------------------+-----------+--------+------+-------------+----------+-------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------+------+-------------+----------+-------------+
| 16fcca48-8729-44ab-b024-ddd5cfd458a4 | available | mydisk | 1 | lvmdriver-1 | false | |
+--------------------------------------+-----------+--------+------+-------------+----------+-------------+

user@localhost:$ mysql cinder -e 'select * from workers;'

user@localhost:$

1.2 – Deletion

Now we proceed to delete the newly created volume and make sure that we also have the workers DB entry while the operation is undergoing and that it is removed once it has completed.

user@localhost:$ cinder delete mydisk; sleep 3; mysql cinder -e 'select * from workers;'
Request to delete volume mydisk has been accepted.
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+
| created_at | updated_at | deleted_at | deleted | id | resource_type | resource_id | status | service_id |
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+
| 2016-08-08 18:26:16 | 2016-08-08 18:26:16 | NULL | 0 | 3 | Volume | 16fcca48-8729-44ab-b024-ddd5cfd458a4 | deleting | 3 |
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+

user@localhost:$ cinder list
+----+--------+------+------+-------------+----------+-------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+----+--------+------+------+-------------+----------+-------------+
+----+--------+------+------+-------------+----------+-------------+

user@localhost:$ mysql cinder -e 'select * from workers;'

user@localhost:$

1.3 – Cleanup

We are going to test that basic cleanup works when the node dies, for this we’ll create a volume that we’ll attach to a VM, create another volume and start creating a snapshot, start creating a volume from an image, start creating a volume, and start deleting another volume.

So in the end we’ll have the following cleanable volume statuses:

  • “in-use”
  • “creating”
  • “deleting”
  • “downloading”

Snapshot:

  • “creating”

NOTE: We’ll have to wait in between some commands, like creating the volume and attaching it, since it needs to be available.

The sequence of commands and results would look like this:

user@localhost:$ nova boot --flavor m1.nano --image cirros-0.3.4-x86_64-uec myvm
+--------------------------------------+----------------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-SRV-ATTR:host | - |
| OS-EXT-SRV-ATTR:hostname | myvm |
| OS-EXT-SRV-ATTR:hypervisor_hostname | - |
| OS-EXT-SRV-ATTR:instance_name | instance-00000001 |
| OS-EXT-SRV-ATTR:kernel_id | 4c1a9ce2-a78e-43ec-99e3-5b532359d62c |
| OS-EXT-SRV-ATTR:launch_index | 0 |
| OS-EXT-SRV-ATTR:ramdisk_id | 7becc8ae-0153-44be-9b81-9c94e5c7849a |
| OS-EXT-SRV-ATTR:reservation_id | r-qy2z7bpp |
| OS-EXT-SRV-ATTR:root_device_name | - |
| OS-EXT-SRV-ATTR:user_data | - |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| adminPass | Xk7g2tbzVNjg |
| config_drive | |
| created | 2016-08-09T10:43:23Z |
| description | - |
| flavor | m1.nano (42) |
| hostId | |
| host_status | |
| id | 74fa7147-a0c5-4b17-b9b3-ce4ebfbea911 |
| image | cirros-0.3.4-x86_64-uec (432c9a2b-8ed2-4957-8d12-063217f26a3f) |
| key_name | - |
| locked | False |
| metadata | {} |
| name | myvm |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| status | BUILD |
| tags | [] |
| tenant_id | 83e1beb749d74956b664ef58c001af29 |
| updated | 2016-08-09T10:43:23Z |
| user_id | c21ca8dae0644e52afe624a518e5e8f2 |
+--------------------------------------+----------------------------------------------------------------+


user@localhost:$ cinder create --name attached 1
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-08-09T10:43:31.000000 |
| description | None |
| encrypted | False |
| id | a9102b47-37ff-4fd2-a76c-44e50c00e1fd |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | attached |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 83e1beb749d74956b664ef58c001af29 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | c21ca8dae0644e52afe624a518e5e8f2 |
| volume_type | lvmdriver-1 |
+--------------------------------+--------------------------------------+


user@localhost:$ nova volume-attach myvm a9102b47-37ff-4fd2-a76c-44e50c00e1fd
+----------+--------------------------------------+
| Property | Value |
+----------+--------------------------------------+
| device | /dev/vdb |
| id | a9102b47-37ff-4fd2-a76c-44e50c00e1fd |
| serverId | 74fa7147-a0c5-4b17-b9b3-ce4ebfbea911 |
| volumeId | a9102b47-37ff-4fd2-a76c-44e50c00e1fd |
+----------+--------------------------------------+


user@localhost:$ cinder create --name deleting_vol 1
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-08-09T12:25:06.000000 |
| description | None |
| encrypted | False |
| id | 36fcb60b-83fc-420b-94cb-1f8f7979ea9d |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | deleting_vol |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 83e1beb749d74956b664ef58c001af29 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | c21ca8dae0644e52afe624a518e5e8f2 |
| volume_type | lvmdriver-1 |
+--------------------------------+--------------------------------------+


user@localhost:$ cinder create --name snapshot_vol 1
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-08-09T12:11:33.000000 |
| description | None |
| encrypted | False |
| id | 6a12f169-c6a7-4de5-85f2-c8259cbd6924 |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | snapshot_vol |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 83e1beb749d74956b664ef58c001af29 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | c21ca8dae0644e52afe624a518e5e8f2 |
| volume_type | lvmdriver-1 |
+--------------------------------+--------------------------------------+


user@localhost:$ mysql cinder -e 'select * from workers;'


user@localhost:$ cinder list
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| 6a12f169-c6a7-4de5-85f2-c8259cbd6924 | available | snapshot_vol | 1 | lvmdriver-1 | false | |
| 36fcb60b-83fc-420b-94cb-1f8f7979ea9d | available | deleting_vol | 1 | lvmdriver-1 | false | |
| a9102b47-37ff-4fd2-a76c-44e50c00e1fd | in-use | attached | 1 | lvmdriver-1 | false | 74fa7147-a0c5-4b17-b9b3-ce4ebfbea911 |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+


user@localhost:$ cinder create --name downloading --image-id cirros-0.3.4-x86_64-uec 1; cinder create --name creating 1; cinder snapshot-create snapshot_vol --name creating_snap; cinder delete deleting_vol; sleep 3; kill -9 -- -`cat /opt/stack/status/stack/c-vol.pid`
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-08-09T12:26:06.000000 |
| description | None |
| encrypted | False |
| id | 58d2c5aa-9334-46a1-9246-0bc893196454 |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | downloading |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 83e1beb749d74956b664ef58c001af29 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | c21ca8dae0644e52afe624a518e5e8f2 |
| volume_type | lvmdriver-1 |
+--------------------------------+--------------------------------------+
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-08-09T12:26:08.000000 |
| description | None |
| encrypted | False |
| id | a7443e99-b87a-4e0a-bb44-6b63bdef477b |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | creating |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 83e1beb749d74956b664ef58c001af29 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | c21ca8dae0644e52afe624a518e5e8f2 |
| volume_type | lvmdriver-1 |
+--------------------------------+--------------------------------------+
+-------------+--------------------------------------+
| Property | Value |
+-------------+--------------------------------------+
| created_at | 2016-08-09T12:26:10.353392 |
| description | None |
| id | acc8b408-2148-4de7-9774-ccb123650244 |
| metadata | {} |
| name | creating_snap |
| size | 1 |
| status | creating |
| updated_at | None |
| volume_id | 6a12f169-c6a7-4de5-85f2-c8259cbd6924 |
+-------------+--------------------------------------+
Request to delete volume deleting_vol has been accepted.


user@localhost:$ mysql cinder -e 'select * from workers;'
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+-------------+------------+
| created_at | updated_at | deleted_at | deleted | id | resource_type | resource_id | status | service_id |
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+-------------+------------+
| 2016-08-09 12:26:06 | 2016-08-09 12:26:08 | NULL | 0 | 45 | Volume | 58d2c5aa-9334-46a1-9246-0bc893196454 | downloading | 3 |
| 2016-08-09 12:26:08 | 2016-08-09 12:26:09 | NULL | 0 | 46 | Volume | a7443e99-b87a-4e0a-bb44-6b63bdef477b | creating | 3 |
| 2016-08-09 12:26:10 | 2016-08-09 12:26:10 | NULL | 0 | 47 | Snapshot | acc8b408-2148-4de7-9774-ccb123650244 | creating | 3 |
| 2016-08-09 12:26:11 | 2016-08-09 12:26:11 | NULL | 0 | 48 | Volume | 36fcb60b-83fc-420b-94cb-1f8f7979ea9d | deleting | 3 |
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+-------------+------------+


user@localhost:$ cinder list
+--------------------------------------+-------------+--------------+------+-------------+----------+--------------------------------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-------------+--------------+------+-------------+----------+--------------------------------------+
| 36fcb60b-83fc-420b-94cb-1f8f7979ea9d | deleting | deleting_vol | 1 | lvmdriver-1 | false | |
| 58d2c5aa-9334-46a1-9246-0bc893196454 | downloading | downloading | 1 | lvmdriver-1 | false | |
| 6a12f169-c6a7-4de5-85f2-c8259cbd6924 | available | snapshot_vol | 1 | lvmdriver-1 | false | |
| a7443e99-b87a-4e0a-bb44-6b63bdef477b | creating | creating | 1 | lvmdriver-1 | false | |
| a9102b47-37ff-4fd2-a76c-44e50c00e1fd | in-use | attached | 1 | lvmdriver-1 | false | 74fa7147-a0c5-4b17-b9b3-ce4ebfbea911 |
+--------------------------------------+-------------+--------------+------+-------------+----------+--------------------------------------+


user@localhost:$ cinder snapshot-list
+--------------------------------------+--------------------------------------+----------+---------------+------+
| ID | Volume ID | Status | Name | Size |
+--------------------------------------+--------------------------------------+----------+---------------+------+
| acc8b408-2148-4de7-9774-ccb123650244 | 6a12f169-c6a7-4de5-85f2-c8259cbd6924 | creating | creating_snap | 1 |
+--------------------------------------+--------------------------------------+----------+---------------+------+

We can see how we didn’t have any entry in the workers table before we execute all the operations and killed the service, and that at the end we have expected entries in it now and that they match the status of the volumes and snapshots. This part is crucial, because if they don’t match they will not be cleaned up.

This would be a simulation of a service that is abruptly interrupted in the middle of some operations and that will need to recover on the next restart.

NOTE: One of the lines to execute is too long and is not completely visible without horizontally scrolling, but at the end you can see the command that is doing the abrupt stop of the service: “`kill -9 — -`cat /opt/stack/status/stack/c-vol.pid““

Before we check the restart of the service we want to remove the iSCSI target to make sure that it gets recreated on service start, but take notice that you need to change the UUID in the command by the UUID of the volume that was attached to the instance:

user@localhost:$ sudo tgt-admin --force --delete iqn.2010-10.org.openstack:volume-c632fd5d-bd05-4eda-a146-796136376ece

user@localhost:$ sudo tgtadm --lld iscsi --mode target --op show

user@localhost:$

And now we can restart the c-vol service we just killed -going to the c-vol window, pressing Ctrl+p, and hitting enter- and check that the service is actually doing what we expect it to do, which is reclaim the workers entries -changing the updated_at field since the service_id is already his- and doing the cleanup. Since most cleanups are just setting the status field to “error” they will be quickly removed from the workers table and we won’t see them anymore, and only the delete operation remains until it is completed.

When we restart the service we’ll see some useful INFO level log entries indicating that the service is starting the cleanup, the operations that are being performed:

2016-10-18 11:29:18.351 INFO cinder.manager [req-c26a5697-9d82-4a08-97e1-1da69fdcea79 None None] Initiating service 3 cleanup
2016-10-18 11:29:18.418 INFO cinder.manager [req-c26a5697-9d82-4a08-97e1-1da69fdcea79 None None] Cleaning Volume with id 58d2c5aa-9334-46a1-9246-0bc893196454 and status downloading
2016-10-18 11:29:18.548 INFO cinder.manager [req-c26a5697-9d82-4a08-97e1-1da69fdcea79 None None] Cleaning Volume with id a7443e99-b87a-4e0a-bb44-6b63bdef477b and status creating
2016-10-18 11:29:18.650 INFO cinder.manager [req-c26a5697-9d82-4a08-97e1-1da69fdcea79 None None] Cleaning Snapshot with id acc8b408-2148-4de7-9774-ccb123650244 and status creating
2016-10-18 11:29:18.905 INFO cinder.manager [req-c26a5697-9d82-4a08-97e1-1da69fdcea79 None None] Cleaning Volume with id 36fcb60b-83fc-420b-94cb-1f8f7979ea9d and status deleting
user@localhost:$ mysql cinder -e 'select * from workers;'
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+
| created_at | updated_at | deleted_at | deleted | id | resource_type | resource_id | status | service_id |
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+
| 2016-08-09 12:26:11 | 2016-08-09 12:32:52 | NULL | 0 | 48 | Volume | 36fcb60b-83fc-420b-94cb-1f8f7979ea9d | deleting | 3 |
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+


user@localhost:$ cinder list
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| 36fcb60b-83fc-420b-94cb-1f8f7979ea9d | deleting | deleting_vol | 1 | lvmdriver-1 | false | |
| 58d2c5aa-9334-46a1-9246-0bc893196454 | error | downloading | 1 | lvmdriver-1 | false | |
| 6a12f169-c6a7-4de5-85f2-c8259cbd6924 | available | snapshot_vol | 1 | lvmdriver-1 | false | |
| a7443e99-b87a-4e0a-bb44-6b63bdef477b | error | creating | 1 | lvmdriver-1 | false | |
| a9102b47-37ff-4fd2-a76c-44e50c00e1fd | in-use | attached | 1 | lvmdriver-1 | false | 74fa7147-a0c5-4b17-b9b3-ce4ebfbea911 |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+


user@localhost:$ cinder snapshot-list
+--------------------------------------+--------------------------------------+--------+---------------+------+
| ID | Volume ID | Status | Name | Size |
+--------------------------------------+--------------------------------------+--------+---------------+------+
| acc8b408-2148-4de7-9774-ccb123650244 | 6a12f169-c6a7-4de5-85f2-c8259cbd6924 | error | creating_snap | 1 |
+--------------------------------------+--------------------------------------+--------+---------------+------+


user@localhost:$ sudo tgtadm --lld iscsi --mode target --op show
Target 1: iqn.2010-10.org.openstack:volume-a9102b47-37ff-4fd2-a76c-44e50c00e1fd
System information:
Driver: iscsi
State: ready
I_T nexus information:
I_T nexus: 2
Initiator: iqn.1994-05.com.redhat:d434849ec720 alias: localhost
Connection: 0
IP Address: 192.168.121.80
LUN information:
LUN: 0
Type: controller
SCSI ID: IET 00010000
SCSI SN: beaf10
Size: 0 MB, Block size: 1
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
SWP: No
Thin-provisioning: No
Backing store type: null
Backing store path: None
Backing store flags:
LUN: 1
Type: disk
SCSI ID: IET 00010001
SCSI SN: beaf11
Size: 1074 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
SWP: No
Thin-provisioning: No
Backing store type: rdwr
Backing store path: /dev/stack-volumes-lvmdriver-1/volume-a9102b47-37ff-4fd2-a76c-44e50c00e1fd
Backing store flags:
Account information:
MoYuxFwQJQvaWJNmz47H
ACL information:
ALL

And after 30 seconds or so the volume named “deleting_vol” will finish deleting and we won’t have the workers table entry anymore:

user@localhost:$ cinder list
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| 58d2c5aa-9334-46a1-9246-0bc893196454 | error | downloading | 1 | lvmdriver-1 | false | |
| 6a12f169-c6a7-4de5-85f2-c8259cbd6924 | available | snapshot_vol | 1 | lvmdriver-1 | false | |
| a7443e99-b87a-4e0a-bb44-6b63bdef477b | error | creating | 1 | lvmdriver-1 | false | |
| a9102b47-37ff-4fd2-a76c-44e50c00e1fd | in-use | attached | 1 | lvmdriver-1 | false | 74fa7147-a0c5-4b17-b9b3-ce4ebfbea911 |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+

user@localhost:$ mysql cinder -e 'select * from workers;'

user@localhost:$

2. Clustered tests

One thing to remember -as it will help us know on which service window we can expect operations to go- is that clustered operations are scheduled using round-robin. So if we create 2 volumes one will be created on each service, and if we don an attach, each service will perform one part of the attachment, since we have the reservation and the connection initiation.

Let’s not forget that we already have some resources -volumes and snapshots- in our backend, and we should check the contents of the DB for the volumes to confirm that they don’t belong to any cluster.

user@localhost:$ mysql cinder -e 'select display_name, id, status, host, cluster_name from volumes where not deleted;'
+--------------+--------------------------------------+-----------+-------------------------------+--------------+
| display_name | id | status | host | cluster_name |
+--------------+--------------------------------------+-----------+-------------------------------+--------------+
| downloading | 58d2c5aa-9334-46a1-9246-0bc893196454 | error | host1@lvmdriver-1#lvmdriver-1 | NULL |
| snapshot_vol | 6a12f169-c6a7-4de5-85f2-c8259cbd6924 | available | host1@lvmdriver-1#lvmdriver-1 | NULL |
| creating | a7443e99-b87a-4e0a-bb44-6b63bdef477b | error | host1@lvmdriver-1#lvmdriver-1 | NULL |
| attached | a9102b47-37ff-4fd2-a76c-44e50c00e1fd | in-use | host1@lvmdriver-1#lvmdriver-1 | NULL |
+--------------+--------------------------------------+-----------+-------------------------------+--------------+

Now we have to stop the c-vol service pressing Ctr+c in the c-vol window and then press Ctrl+p to bring back the command that started the service so we can modify the command and add --config-file /etc/cinder/host1.conf right after --config-file /etc/cinder/cinder.conf before running the command. Now we run the command and with this, we are effectively starting the service in the cluster. The command would look something like this (path to cinder-volume is different in Ubuntu):

user@localhost:$ usr/bin/cinder-volume --config-file /etc/cinder/cinder.conf --config-file /etc/cinder/host1.conf & echo $! >/opt/stack/status/stack/c-vol.pid; fg || echo "c-vol failed to start" | tee "/opt/stack/status/stack/c-vol.failure"

Now we go to the c-vol2 window and run the command that is already written there.

2.0 – Sanity checks

We now have 2 services running in the same cluster.

  • Check cluster status
user@localhost:$ cinder --os-volume-api-version 3.11 cluster-list --detail
+-----------------------+---------------+-------+---------+-----------+----------------+----------------------------+-----------------+----------------------------+------------+
| Name | Binary | State | Status | Num Hosts | Num Down Hosts | Last Heartbeat | Disabled Reason | Created At | Updated at |
+-----------------------+---------------+-------+---------+-----------+----------------+----------------------------+-----------------+----------------------------+------------+
| mycluster@lvmdriver-1 | cinder-volume | up | enabled | 2 | 0 | 2016-08-09T13:42:20.000000 | - | 2016-08-09T13:41:13.000000 | |
| mycluster@lvmdriver-2 | cinder-volume | up | enabled | 2 | 0 | 2016-08-09T13:42:20.000000 | - | 2016-08-09T13:41:13.000000 | |
+-----------------------+---------------+-------+---------+-----------+----------------+----------------------------+-----------------+----------------------------+------------+

  • Check services status:
user@localhost:$ cinder --os-volume-api-version 3.11 service-list
+------------------+-------------------+------+---------+-------+----------------------------+-----------------------+-----------------+
| Binary | Host | Zone | Status | State | Updated_at | Cluster | Disabled Reason |
+------------------+-------------------+------+---------+-------+----------------------------+-----------------------+-----------------+
| cinder-backup | host1 | nova | enabled | up | 2016-08-09T13:42:45.000000 | - | - |
| cinder-scheduler | host1 | nova | enabled | up | 2016-08-09T13:42:47.000000 | - | - |
| cinder-volume | host1@lvmdriver-1 | nova | enabled | up | 2016-08-09T13:42:50.000000 | mycluster@lvmdriver-1 | - |
| cinder-volume | host1@lvmdriver-2 | nova | enabled | up | 2016-08-09T13:42:50.000000 | mycluster@lvmdriver-2 | - |
| cinder-volume | host2@lvmdriver-1 | nova | enabled | up | 2016-08-09T13:42:46.000000 | mycluster@lvmdriver-1 | - |
| cinder-volume | host2@lvmdriver-2 | nova | enabled | up | 2016-08-09T13:42:46.000000 | mycluster@lvmdriver-2 | - |
+------------------+-------------------+------+---------+-------+----------------------------+-----------------------+-----------------+

  • Check RabbitMQ cluster queues:
user@localhost:$ sudo rabbitmqctl list_queues name | grep cinder-volume.mycluster | sort
cinder-volume.mycluster@lvmdriver-2
cinder-volume.mycluster@lvmdriver-2.host1
cinder-volume.mycluster@lvmdriver-2.host2
cinder-volume.mycluster@lvmdriver-2_fanout_0ad93e1623d3497fa43ad0e14abb97ef
cinder-volume.mycluster@lvmdriver-2_fanout_3cb4445f028b496f821488f492b7f159
cinder-volume.mycluster@lvmdriver-1
cinder-volume.mycluster@lvmdriver-1.host1
cinder-volume.mycluster@lvmdriver-1.host2
cinder-volume.mycluster@lvmdriver-1_fanout_0f51e0a5c82147a08c36e2a55ec137a2
cinder-volume.mycluster@lvmdriver-1_fanout_187f8d0d2d0e43b487f935d9fd3dfe4a

We have 2 cluster specific queues where all our services will be listening to get jobs on a round_robbing schedule, and then the clustered service specific fanout queues that will subscribe to the 2 cluster fanout exchanges. We require the fanout functionality on the clusters because replication needs to inform all services when a failover has been performed by another service in the cluster so they can change their connection information accordingly.

The remaining 4 queues are one for each specific service backend, but they will not be used, since we can already access those services using existing host topic queues. The reason why they are created is Oslo Messaging’s behavior.

Here we can see the 2 cluster fanout exchanges:

user@localhost:$ sudo rabbitmqctl list_exchanges name | grep cinder-volume.mycluster
cinder-volume.mycluster@lvmdriver-1_fanout
cinder-volume.mycluster@lvmdriver-2_fanout

  • Check existing volumes were moved to the cluster when the service was started
user@localhost:$ mysql cinder -e 'select display_name, id, status, host, cluster_name from volumes where not deleted;'
+--------------+--------------------------------------+-----------+-------------------------------+-----------------------------------+
| display_name | id | status | host | cluster_name |
+--------------+--------------------------------------+-----------+-------------------------------+-----------------------------------+
| downloading | 58d2c5aa-9334-46a1-9246-0bc893196454 | error | host1@lvmdriver-1#lvmdriver-1 | mycluster@lvmdriver-1#lvmdriver-1 |
| snapshot_vol | 6a12f169-c6a7-4de5-85f2-c8259cbd6924 | available | host1@lvmdriver-1#lvmdriver-1 | mycluster@lvmdriver-1#lvmdriver-1 |
| creating | a7443e99-b87a-4e0a-bb44-6b63bdef477b | error | host1@lvmdriver-1#lvmdriver-1 | mycluster@lvmdriver-1#lvmdriver-1 |
| attached | a9102b47-37ff-4fd2-a76c-44e50c00e1fd | in-use | host1@lvmdriver-1#lvmdriver-1 | mycluster@lvmdriver-1#lvmdriver-1 |
+--------------+--------------------------------------+-----------+-------------------------------+-----------------------------------+

2.1 – Volume creation

The most basic thing we need to test is that we are able to create a volume in a cluster and that we are creating the workers table entry. To see that we are really sending it to the cluster we’ll just create 2 volumes instead of one. It is useful to have c-vol and c-vol2 windows open to see in the logs how each one is processing one of the creations.

user@localhost:$ cinder create --name host1 1; cinder create --name host2 1; sleep 3; mysql cinder -e 'select * from workers;'
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-08-09T14:04:15.000000 |
| description | None |
| encrypted | False |
| id | a592ff26-d70c-4a0e-92a3-ad3f5b8ac599 |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | host1 |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 83e1beb749d74956b664ef58c001af29 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | c21ca8dae0644e52afe624a518e5e8f2 |
| volume_type | lvmdriver-1 |
+--------------------------------+--------------------------------------+
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-08-09T14:04:16.000000 |
| description | None |
| encrypted | False |
| id | 271aac9a-e9c2-4a89-87d2-c6fd13d81a5d |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | host2 |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 83e1beb749d74956b664ef58c001af29 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | c21ca8dae0644e52afe624a518e5e8f2 |
| volume_type | lvmdriver-1 |
+--------------------------------+--------------------------------------+
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+
| created_at | updated_at | deleted_at | deleted | id | resource_type | resource_id | status | service_id |
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+
| 2016-08-09 14:04:15 | 2016-08-09 14:04:15 | NULL | 0 | 53 | Volume | a592ff26-d70c-4a0e-92a3-ad3f5b8ac599 | creating | 3 |
| 2016-08-09 14:04:16 | 2016-08-09 14:04:17 | NULL | 0 | 54 | Volume | 271aac9a-e9c2-4a89-87d2-c6fd13d81a5d | creating | 5 |
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+

We can see that we have 2 new entries in the workers table for the volumes we are creating, and service 3 and service 5 are performing these operations.

It is important that once the volumes have been created we check that the workers table is empty and both volumes are in “available” status.

user@localhost:$ cinder list
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| 271aac9a-e9c2-4a89-87d2-c6fd13d81a5d | available | host2 | 1 | lvmdriver-1 | false | |
| 58d2c5aa-9334-46a1-9246-0bc893196454 | error | downloading | 1 | lvmdriver-1 | false | |
| 6a12f169-c6a7-4de5-85f2-c8259cbd6924 | available | snapshot_vol | 1 | lvmdriver-1 | false | |
| a592ff26-d70c-4a0e-92a3-ad3f5b8ac599 | available | host1 | 1 | lvmdriver-1 | false | |
| a7443e99-b87a-4e0a-bb44-6b63bdef477b | error | creating | 1 | lvmdriver-1 | false | |
| a9102b47-37ff-4fd2-a76c-44e50c00e1fd | in-use | attached | 1 | lvmdriver-1 | false | 74fa7147-a0c5-4b17-b9b3-ce4ebfbea911 |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+

user@localhost:$ mysql cinder -e 'select * from workers;'

user@localhost:$

If you check the DB contents for the new volumes you may be surprised with what you see:

Checking the DB contents for the new volumes will reveal that the host field reflects the service that handled the creation of the volume. In theory this was not necessary, since these volumes will be addressed by their cluster_name and even the cleanup would work regardless of the host field, but we wanted to make it as consistent as possible with the reality, and that’s why the scheduler will fill in the host field with one of the services that is up before sending it to the cluster and then the volume service will update the host field ine the DB once it has received it.

user@localhost:$ mysql cinder -e 'select display_name, id, status, host, cluster_name from volumes where display_name in ("host1", "host2");'
+--------------+--------------------------------------+-----------+-------------------------------+-----------------------------------+
| display_name | id | status | host | cluster_name |
+--------------+--------------------------------------+-----------+-------------------------------+-----------------------------------+
| host2 | 271aac9a-e9c2-4a89-87d2-c6fd13d81a5d | available | host1@lvmdriver-1#lvmdriver-1 | mycluster@lvmdriver-1#lvmdriver-1 |
| host1 | a592ff26-d70c-4a0e-92a3-ad3f5b8ac599 | available | host2@lvmdriver-1#lvmdriver-1 | mycluster@lvmdriver-1#lvmdriver-1 |
+--------------+--------------------------------------+-----------+-------------------------------+-----------------------------------+

As you can see both have the host field set to “host1@lvmdriver-1#lvmdriver-1” even though host2 was created in host2. You may think that this is a mistake, but it’s not, it’s what we can expect, since the scheduler doesn’t know which host from the cluster will be taking the job it just assigns one host that is up.

This will not be an issue for the cleanup and it’s only relevant for operations that are not cluster aware yet, as they will all be going to the same host instead of distributed among all the hosts as we’ll see in the next section.

2.2 – Snapshot creation


Snapshot creation is not yet cluster aware, so it will still use the host DB field to direct the job. This will soon change, but it’s a good opportunity to illustrate what I meant in the previous test. Creating 2 snapshots from the volumes we created in the previous section we’ll see how they are both handled by “host1”.

In previous versions of this post we mentioned that Snapshot creation was not cluster aware, and that when creating 2 snapshot volumes they would both go to host1, but that’s no longer the case and now they will be also spread among all existing services in the cluster.

user@localhost:$ cinder snapshot-create host1 --name host1_snap; cinder snapshot-create host2 --name host2_snap; sleep 3; mysql cinder -e 'select * from workers;'
+-------------+--------------------------------------+
| Property | Value |
+-------------+--------------------------------------+
| created_at | 2016-08-09T14:21:28.497009 |
| description | None |
| id | 7d0923dd-c666-41df-ab12-2887e6a04bc3 |
| metadata | {} |
| name | host1_snap |
| size | 1 |
| status | creating |
| updated_at | None |
| volume_id | a592ff26-d70c-4a0e-92a3-ad3f5b8ac599 |
+-------------+--------------------------------------+
+-------------+--------------------------------------+
| Property | Value |
+-------------+--------------------------------------+
| created_at | 2016-08-09T14:21:30.478635 |
| description | None |
| id | 6d2124b8-2cdd-48ad-b525-27db18470587 |
| metadata | {} |
| name | host2_snap |
| size | 1 |
| status | creating |
| updated_at | None |
| volume_id | 271aac9a-e9c2-4a89-87d2-c6fd13d81a5d |
+-------------+--------------------------------------+
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+
| created_at | updated_at | deleted_at | deleted | id | resource_type | resource_id | status | service_id |
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+
| 2016-08-09 14:21:28 | 2016-08-09 14:21:28 | NULL | 0 | 55 | Snapshot | 7d0923dd-c666-41df-ab12-2887e6a04bc3 | creating | 3 |
| 2016-08-09 14:21:30 | 2016-08-09 14:21:30 | NULL | 0 | 56 | Snapshot | 6d2124b8-2cdd-48ad-b525-27db18470587 | creating | 5 |
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+

We can see that we have 2 new entries in the workers table for the volumes we are creating, and both are being executed by service 3 (“host1”) and they are properly distributed like with volume creation.

As usual, we check the results:

user@localhost:$ cinder snapshot-list
+--------------------------------------+--------------------------------------+-----------+---------------+------+
| ID | Volume ID | Status | Name | Size |
+--------------------------------------+--------------------------------------+-----------+---------------+------+
| 6d2124b8-2cdd-48ad-b525-27db18470587 | 271aac9a-e9c2-4a89-87d2-c6fd13d81a5d | available | host2_snap | 1 |
| 7d0923dd-c666-41df-ab12-2887e6a04bc3 | a592ff26-d70c-4a0e-92a3-ad3f5b8ac599 | available | host1_snap | 1 |
| acc8b408-2148-4de7-9774-ccb123650244 | 6a12f169-c6a7-4de5-85f2-c8259cbd6924 | error | creating_snap | 1 |
+--------------------------------------+--------------------------------------+-----------+---------------+------+


user@localhost:$ mysql cinder -e 'select * from workers;'

user@localhost:$

2.3 – Deletion

Volume, Snapshot, Consistency Group, and Consistency Group Snapshot deletions are cluster aware, so even though both snapshots were created in “host1”, deletion will be spread between the 2 hosts, as we can see in log output.

user@localhost:$ cinder snapshot-delete host1_snap; cinder snapshot-delete host2_snap; cinder delete downloading; sleep 3; mysql cinder -e 'select * from workers;'
Request to delete volume downloading has been accepted.
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+
| created_at | updated_at | deleted_at | deleted | id | resource_type | resource_id | status | service_id |
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+
| 2016-08-09 15:40:53 | 2016-08-09 15:40:53 | NULL | 0 | 59 | Volume | 58d2c5aa-9334-46a1-9246-0bc893196454 | deleting | 3 |
+---------------------+---------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+


user@localhost:$ cinder list
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| 271aac9a-e9c2-4a89-87d2-c6fd13d81a5d | available | host2 | 1 | lvmdriver-1 | false | |
| 6a12f169-c6a7-4de5-85f2-c8259cbd6924 | available | snapshot_vol | 1 | lvmdriver-1 | false | |
| a592ff26-d70c-4a0e-92a3-ad3f5b8ac599 | available | host1 | 1 | lvmdriver-1 | false | |
| a7443e99-b87a-4e0a-bb44-6b63bdef477b | error | creating | 1 | lvmdriver-1 | false | |
| a9102b47-37ff-4fd2-a76c-44e50c00e1fd | in-use | attached | 1 | lvmdriver-1 | false | 74fa7147-a0c5-4b17-b9b3-ce4ebfbea911 |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+


user@localhost:$ cinder snapshot-list
+--------------------------------------+--------------------------------------+--------+---------------+------+
| ID | Volume ID | Status | Name | Size |
+--------------------------------------+--------------------------------------+--------+---------------+------+
| acc8b408-2148-4de7-9774-ccb123650244 | 6a12f169-c6a7-4de5-85f2-c8259cbd6924 | error | creating_snap | 1 |
+--------------------------------------+--------------------------------------+--------+---------------+------+


user@localhost:$ mysql cinder -e 'select * from workers;'

user@localhost:$

If you are wondering why we don’t have workers table entries for the deletion, that’s because snapshot deletion is not cleanable in existing code. So it’s something we’ll probably want to add, but it’s not specific to the High Availability Active-Active work.

2.4 – Attach/Detach volume

Due to a regression recently introduced by a patch, we cannot properly attach 2 volumes to the same instance, so for this test we’ll have to detach attached volume first.

user@localhost:$ nova volume-detach myvm a9102b47-37ff-4fd2-a76c-44e50c00e1fd

user@localhost:$

Once detaching is completed we’ll stop the “host2” service to confirm that the value of the host field is not relevant for clustered actions, because even though the host field is set to “host1”, the “host2” can handle the attach/detach operations as well since it’s in the same cluster:

Go to the c-vol window and stop the service using Ctrl+c, now we have to wait until the service is no longer considered to be alive in the cluster and is reported as being down. We can easily check this with:

user@localhost:$ cinder --os-volume-api-version 3.11 cluster-list --detail
+-----------------------+---------------+-------+---------+-----------+----------------+----------------------------+-----------------+----------------------------+------------+
| Name | Binary | State | Status | Num Hosts | Num Down Hosts | Last Heartbeat | Disabled Reason | Created At | Updated at |
+-----------------------+---------------+-------+---------+-----------+----------------+----------------------------+-----------------+----------------------------+------------+
| mycluster@lvmdriver-1 | cinder-volume | up | enabled | 2 | 1 | 2016-08-09T15:52:55.000000 | - | 2016-08-09T13:41:13.000000 | |
| mycluster@lvmdriver-2 | cinder-volume | up | enabled | 2 | 1 | 2016-08-09T15:52:55.000000 | - | 2016-08-09T13:41:13.000000 | |
+-----------------------+---------------+-------+---------+-----------+----------------+----------------------------+-----------------+----------------------------+------------+

Once the service is no longer considered to be up -this is not really necessary for this test, but it’s good to check that the reporting works- we’ll proceed to attach the newly created volume and confirm that remaining volume service is able to attach it and creates the right entries in the DB and then detach the volume.

user@localhost:$ nova volume-attach myvm a592ff26-d70c-4a0e-92a3-ad3f5b8ac599
+----------+--------------------------------------+
| Property | Value |
+----------+--------------------------------------+
| device | /dev/vdc |
| id | a592ff26-d70c-4a0e-92a3-ad3f5b8ac599 |
| serverId | 74fa7147-a0c5-4b17-b9b3-ce4ebfbea911 |
| volumeId | a592ff26-d70c-4a0e-92a3-ad3f5b8ac599 |
+----------+--------------------------------------+


user@localhost:$ mysql cinder -e 'select id, status, attach_status from volumes where status="in-use"; select id, volume_id, instance_uuid, attach_status from volume_attachment where not deleted;'
+--------------------------------------+--------+---------------+
| id | status | attach_status |
+--------------------------------------+--------+---------------+
| a592ff26-d70c-4a0e-92a3-ad3f5b8ac599 | in-use | attached |
+--------------------------------------+--------+---------------+
+--------------------------------------+--------------------------------------+--------------------------------------+---------------+
| id | volume_id | instance_uuid | attach_status |
+--------------------------------------+--------------------------------------+--------------------------------------+---------------+
| 4b6595f3-d2e9-46ae-86e5-353d8ad984b0 | a592ff26-d70c-4a0e-92a3-ad3f5b8ac599 | 74fa7147-a0c5-4b17-b9b3-ce4ebfbea911 | attached |
+--------------------------------------+--------------------------------------+--------------------------------------+---------------+


user@localhost:$ nova volume-detach myvm a592ff26-d70c-4a0e-92a3-ad3f5b8ac599

2.5 – Cleanup

We tested non clustered cleanup, now we’ll do a simple test to confirm that it also works on a clustered service, not that there’s any difference really.

We only have 1 service up and running, “c-vol2”, so we’ll request 1 volume creation and 1 volume deletion, stop the service, check the workers table, and restart the service.

user@localhost:$ mysql cinder -e "select * from workers;"

user@localhost:$ cinder delete creating; cinder create --name creating2 1; sleep 3; kill -9 -- -`cat /opt/stack/status/stack/c-vol2.pid`
Request to delete volume creating has been accepted.
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-10-18T15:14:40.000000 |
| description | None |
| encrypted | False |
| id | 29d19119-468b-4aab-93d0-c1a9cf5ca3b8 |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | creating2 |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 2c61bc5ef2f143d9a8b16ac1f77ec6a6 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | 6fbbe3c7c72145f7856b8c2681c51eeb |
| volume_type | lvmdriver-1 |
+--------------------------------+--------------------------------------+

user@localhost:$ mysql cinder -e "select * from workers;"
+---------------------+----------------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+
| created_at | updated_at | deleted_at | deleted | id | resource_type | resource_id | status | service_id |
+---------------------+----------------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+
| 2016-10-18 15:14:39 | 2016-10-18 15:14:39.126670 | NULL | 0 | 95 | Volume | a7443e99-b87a-4e0a-bb44-6b63bdef477b | deleting | 5 |
| 2016-10-18 15:14:41 | 2016-10-18 15:14:41.467478 | NULL | 0 | 96 | Volume | 29d19119-468b-4aab-93d0-c1a9cf5ca3b8 | creating | 5 |
+---------------------+----------------------------+------------+---------+----+---------------+--------------------------------------+----------+------------+

A very important part of this test is confirming that a service doesn’t automatically clean up after other services in the cluster on restart, because that’s one of the issues that the new cleanup mechanism is supposed to resolve.

So we’ll be restarting “c-vol” service and checking that neither the workers table nor the resources change.

Once that is confirmed, we can restart “c-vol2” service and, just like we did before, check that one of the volumes will go to error and the other one will be deleted, leaving the workers table empty.

2.6 – Manual cleanup

Now that we know that a service can properly clean up after itself regardless of its cluster status, we have to cover the case where a service does not really come up again so we want another service in the cluster to do the cleanup.

For this, we have the new cleanup API from microversion 3.18, that can be triggered with the worker-cleanup CLI command.

So we stop “c-vol2” service, that way we’ll create the resources with the “c-vol” service, which is the one that’ll be in the host field of the resources in the DB.

Then request a creation and deletion much like we did before:

user@localhost:$ kill -- -`cat /opt/stack/status/stack/c-vol2.pid`

user@localhost:$ mysql cinder -e "select * from workers;"

user@localhost:$ cinder delete creating2; cinder create --name creating3 1; sleep 3; kill -9 -- -`cat /opt/stack/status/stack/c-vol.pid`
Request to delete volume creating2 has been accepted.
+--------------------------------+--------------------------------------+
| Property | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-10-18T15:46:48.000000 |
| description | None |
| encrypted | False |
| id | 11f4107d-b02b-413b-ac74-5d142d89d146 |
| metadata | {} |
| migration_status | None |
| multiattach | False |
| name | creating3 |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 2c61bc5ef2f143d9a8b16ac1f77ec6a6 |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| updated_at | None |
| user_id | 6fbbe3c7c72145f7856b8c2681c51eeb |
| volume_type | lvmdriver-1 |
+--------------------------------+--------------------------------------+

user@localhost:$ mysql cinder -e "select * from workers;"
+---------------------+----------------------------+------------+---------+-----+---------------+--------------------------------------+----------+------------+
| created_at | updated_at | deleted_at | deleted | id | resource_type | resource_id | status | service_id |
+---------------------+----------------------------+------------+---------+-----+---------------+--------------------------------------+----------+------------+
| 2016-10-18 15:46:47 | 2016-10-18 15:46:47.077675 | NULL | 0 | 103 | Volume | 29d19119-468b-4aab-93d0-c1a9cf5ca3b8 | deleting | 3 |
| 2016-10-18 15:46:48 | 2016-10-18 15:46:49.236839 | NULL | 0 | 104 | Volume | 11f4107d-b02b-413b-ac74-5d142d89d146 | creating | 3 |
+---------------------+----------------------------+------------+---------+-----+---------------+--------------------------------------+----------+------------+

We now restart “c-vol2” service, wait until “c-vol” is reported as down, and send the cleanup request:

user@localhost:$ cinder --os-volume-api-version 3.18 work-cleanup
Following services will be cleaned:
+----+-----------------------+-------------------+---------------+
| ID | Cluster Name | Host | Binary |
+----+-----------------------+-------------------+---------------+
| 3 | mycluster@lvmdriver-1 | host1@lvmdriver-1 | cinder-volume |
| 4 | mycluster@lvmdriver-2 | host1@lvmdriver-2 | cinder-volume |
+----+-----------------------+-------------------+---------------+

And if we check the workers table we’ll see how the resources are being claimed by the service doing the cleanup as it cleans the resources (they are not claimed all at once):

user@localhost:$ mysql cinder -e "select * from workers;"
+---------------------+----------------------------+------------+---------+-----+---------------+--------------------------------------+----------+------------+
| created_at | updated_at | deleted_at | deleted | id | resource_type | resource_id | status | service_id |
+---------------------+----------------------------+------------+---------+-----+---------------+--------------------------------------+----------+------------+
| 2016-10-18 15:46:47 | 2016-10-18 15:56:52.424241 | NULL | 0 | 103 | Volume | 26d284b6-1fcc-41d1-b9d3-d7f84c9174fa | deleting | 5 |
| 2016-10-18 15:46:48 | 2016-10-18 15:46:49.236839 | NULL | 0 | 104 | Volume | 11f4107d-b02b-413b-ac74-5d142d89d146 | creating | 3 |
+---------------------+----------------------------+------------+---------+-----+---------------+--------------------------------------+----------+------------+

We didn’t specify any parameters when requesting the cleanup because we wanted to clean up all services that are down at the moment, but we could have cleanup only specific hosts or resources. Full help of the command explains available options.

usage: cinder work-cleanup [--cluster <cluster-name>] [--host <hostname>]
                           [--binary <binary>]
                           [--is-up <True|true|False|false>]
                           [--disabled <True|true|False|false>]
                           [--resource-id <resource-id>]
                           [--resource-type <Volume|Snapshot>]

Request cleanup of services with optional filtering.

Optional arguments:
  --cluster <cluster-name>
                        Cluster name. Default=None.
  --host <hostname>     Service host name. Default=None.
  --binary <binary>     Service binary. Default=None.
  --is-up <True|true|False|false>
                        Filter by up/down status, if set to true services need
                        to be up, if set to false services need to be down.
                        Default is None, which means up/down status is
                        ignored.
  --disabled <True|true|False|false>
                        Filter by disabled status. Default=None.
  --resource-id <resource-id>
                        UUID of a resource to cleanup. Default=None.
  --resource-type <Volume|Snapshot>
                        Type of resource to cleanup.

Future manual tests

These are just some manual tests to illustrate how things work and how we can test things, giving what I think is a good sample of different scenarios. I have tested more cases and manually debugged the code to ensure that specific paths are being followed in some corner cases, but I think these are a good start and can serve as a guide for people to test the other cluster aware operations.

I have tested more cluster aware operations than the ones I have included here, but I think these give a decent sample on the tests and expected results, so more tests can be easily done in this setup, and we can also easily add another cinder-volume service that is out of the cluster to this setup to confirm they can peacefully coexist.

I’ll be doing more tests on all operations that are already cluster aware, current and new, and even more once new features are added, like triggering service cleanups from the API.

Current cluster aware operations


– Manage existing
– Get pools
– Create volume
– Retype
– Migrate volume to host
– Create consistency group
– Update service capabilities
– Delete volume
– Delete snapshot
– Delete consistency group
– Delete consistency group snapshot
– Attach volume
– Detach volume
– Initialize connection
– Terminate connection
– Remove export

With the latest patches all operations in the Cinder volume service, including replication and image volume caching, are cluster aware.

On Automatic Tests

I believe that to be reasonably certain that the Active-Active implementation is reliable we’ll need some sort of Fault Injection mechanism for Cinder intended not for real deployments, like Tempest is, but designed for testing deployments.

The reason for this is that you cannot automatically create a real life workload, make it fail, and then check the results without a really knowing what specific part of the code was actually being run at the moment the failure occurred. Some failures can be externally simulated, but the simultaion of others present their own challenges.

Again we’ll take the crawl-walk-run approach beginning with manual tests, then some kind of automated tests, then add multi-node CI jobs, and finally -hopefully- introducing the Fault Injection mechanism to add additional tests.


Picture: “Checklist” by [Claire Dela Cruz](https://pixabay.com/en/users
/ellisedelacruz-2310550/) is licensed under CC0 1.0


Leave a comment

Your email address will not be published. Required fields are marked *

4 thoughts on “Manual validation of Cinder A/A patches

  • Scott DAngelo

    Great Stuff, Gorka.
    I’d add that for Ubuntu, path to c-vol is:
    /usr/local/bin/cinder-volume
    so that would need to change for Section 2. and the start command in window c-vol2

    • geguileo Post author

      Thanks for letting me know, I’ll update the post to use cinder-volume without the path so it will work regardless of the system.

      • Erlon R. Cruz

        Great stuff Gorka. I notice that after I stop one c-vol service it takes around 15 seconds to the status be shown as down. Why does it take so long? Heart bit configuration I guess? What happens with the requests in this lapse?

        • geguileo Post author

          Thanks Erlon.
          Yes, it’s a matter of the heartbeats, and there are 2 factors in this equation, the heartbeat frequency and the time since the last heartbeat necessary for a service to be considered as being down.
          If we want a faster detection we have to reduce the time necessary for a node to be considered as down, but we must have at least a couple of heartbeats in that interval or we risk having false positives if there are transient issues accessing the DB. We must also consider that the highest the reporting frequency, the more queries to the DB we’ll have.
          Configuration parameters are “report_interval”, that defaults to 10 seconds, and “service_down_time” that defaults to 60 seconds.