A Cinder Road to Active/Active HA 8


We all want to see OpenStack’s Block Storage Service operating in High Availability with Active/Active node configurations, and we are all keen to contribute to make it happen, but what does it take to get there?

Follow the yellow brick road!

Wasn’t Cinder already Active-Active?

Maybe you’ve been told that Cinder could be configured as Active/Active, or you’ve even seen deployments configured this way, or you just assumed that this was supported, after all it’s the Cloud, and we like our Clouds redundant, resilient and unknowing of the word downtime.

The hard truth is that no mater what you’ve been told, what you’ve seen or what you thought, Cinder does not support High-Availability Active/Active configurations. Just because it doesn’t support Active/Active doesn’t mean that Cinder is not Highly-Available, because that is incorrect; you can still have Active/Passive, N+1, N+M and N-to-1 node configurations.

As for Active/Active, sure, you can deploy your Cloud as Active/Active, and you may not see many or any errors, but you are playing with fire my friend, because when it fails, and it will eventually do, it will be disastrous.

Cinder running as Active-Active

So what’s the plan?

There have been quite some efforts in the community to prepare Cinder to be ready for Active/Active support, unfortunately after all this effort we don’t have much to show for it. It is my belief that this is because we have decided to take the hard road, one that will fix everything while providing a state of the art Active/Active solution.

Usually I am all in favor of this approach, but maybe in this case it is not the best approach, because this requires not only drastic changes in Cinder – which any solution will require – but also changes to other OpenStack components, in particular to Nova. I think we would be better served by taking a different approach, an incremental approach where we first deliver a functional but basic solution and then start working up from there until we reach the solution with all the bells and whistles, which was the one we really wanted all along.

This post tries not only to provide a bird’s-eye view of the Active/Active problem, but also to break it down into smaller problems so we can have a clear view of how the different parts interconnect and interact with each other. This way we can have a clear view of the individual tasks, with their issues, possible solutions and improvements; which should make it easier to coordinate our efforts.

My intentions is not to tell the community what we have to do, but to provide a rough solution to be analyzed and torn apart, but one we can use as a starting point for our conversations.

I considered creating specs instead of a post, but I ended up dismissing it, because I think it would be harder to follow it as a whole, since it would end up being 5 or 6 specs inside a blueprint. Once we agree on a solution for each of the problems I have no problem creating the corresponding specs.

When you think of an Active/Active service you probably have a list of features you think the service should have. Some are basic features that we all agree on and others are more advanced and may disagree on, but for the sake of discussion lets assume this list:

  1. All active nodes from the same cluster should be able to handle requests for any resource managed by the common back-end configuration.
  2. Simultaneous action calls on the same resource on different nodes must be properly handled.
  3. On node startup, after adding a new node or restarting a failed node, undergoing jobs on other nodes should be unaffected.
  4. On failure, node is expected to leave things as tidy as possible.
  5. After node failure pending and unfinished jobs from failed node pending jobs are expected to be picked up by active nodes.
  6. When Node loses connection no data corruption must happen by 2 nodes accessing the same resource, be it by the node that picks jobs from failed node or from nodes that have jobs with operations on the same resource.
  7. Simultaneous actions on the same back-end should be properly handled.

Instead of going through each of these points and explaining what Cinder has now, what needs to have for Active/Active, possible solutions, pros and cons of each of them, etc. I’m going to take a different approach, I’m going to give a list of tasks with only one solution, and this would be my chosen path. Please bear with me until you have finished reading all the steps in that path before dismissing it. I’m not saying that it’s perfect, but some of those “issues” are resolved in other steps. Once you have read that section you’ll probably be wondering why have I chosen those solutions, and may even have your own ideas or know other options that have been proposed in meetings, and that’s when we go into alternatives, Finally I’ll discuss how we can implement some of those alternatives at a later time, once the basic solution is finished.

Us engineers know that a snippet of code is sometimes worth a thousand words, so I’ve created some sample patches for most of the things I talk in the draft solution. These patches are not complete, as they don’t fix all incidences of the issue they address and most of them don’t even have tests. But they are working patches and can be tried. In some cases there are PoC patches from other community members already under review in Gerrit and I’ll be referencing them in their respective sections.

The road taken

Although this could still be an interesting read, there is a newer post that gives a simpler solution to the problem

Job distribution

We are currently distributing jobs to Volume Nodes using an AMQP topic queue with node’s host configuration parameter as the topic. This implementation was well thought from the start, so we can still use it for multiple nodes getting jobs from the same queue, as the broker makes sure that the same job is not delivered to multiple Nodes.

Job distribution using same host

Job distribution using same host

Even thought current job distribution system is fine as it is, we encounter a problem in the Volume Nodes, since they assume they are the only ones accessing their respective back-ends, which is true for all supported node configurations, so when a Volume Node starts the service it looks in the DB for any of his resources – identified by the host field – that is not in a permanent state and fixes that stated. Usually this means that resources in deleting status are requested to be deleted again, since it’s easy to recover these stuck operations. Other operations are not that easy to resume, like creating and downloading, so for those cases resources are set to error. We also make sure in-use resources are exported by the back-end.

This creates a problem for Active/Active configurations, because you can no longer be sure that all jobs that are from your hosts belong to you, you can’t even tell if you are the only node currently working with that back-end. So this results in nodes not knowing which jobs are actually stuck and need to be fixed.

The solution to this problem is quite easy and elegant, we just need to change the way Volume Nodes acknowledge AMQP messages/jobs. Right now we are acknowledging the reception of the jobs, so as soon as they arrive on the node we send a message to the broker to let it know we have the message/job, so he takes the message out from the queue since we have it now. All we need to do is change when we send the acknowledgement, instead of sending it as soon as we receive the message/job, we wait until we have completed the job to acknowledge the broker. By doing this the broker will take out the message/job from the queue once it has been received by the Volume Node, but it will not be completely removed from the Broker, it will be left there awaiting from acknowledgement and it will not be completely removed until the node send the acknowledgement. So what happens when a Volume Node dies? The broker detects that the connection with the node is lost and he restores unacknowledged message/job to the queue for other node to retrieve it, or the same if it’s the only one for the back-end.

To change the ACK order we change get_rpc_server in cinder/rpc.py when using oslo.messaging 1.17.0 or higher, it’s slightly different if it’s version 1.16.0 or earlier:

class RPCDispatcherPostAck(messaging.rpc.dispatcher.RPCDispatcher):
    def __call__(self, incoming, executor_callback=None):
        return messaging.rpc.dispatcher.utils.DispatcherExecutorContext(
            incoming, self._dispatch_and_reply,
            executor_callback=executor_callback)

    def _dispatch_and_reply(self, incoming, executor_callback):
        super(RPCDispatcherPostAck, self)._dispatch_and_reply(
            incoming, executor_callback)
        incoming.acknowledge()

def get_rpc_server(transport, target, endpoints, executor='blocking',
                   serializer=None):
    dispatcher = RPCDispatcherPostAck(target, endpoints, serializer)
    return messaging.server.MessageHandlingServer(transport, dispatcher,
                                                  executor)

def get_server(target, endpoints, serializer=None):
    assert TRANSPORT is not None
    serializer = RequestContextSerializer(serializer)
    return get_rpc_server(TRANSPORT,
                          target,
                          endpoints,
                          executor='eventlet',
                          serializer=serializer)

So, instead of having the stuck detection and cleanup in the Volume manager’s init_host method we will now have it when we are going to perform a task. This means that in order to do the cleanup we need a way to differentiate between tasks that have just been queued and those that come from a failed node. For this we will add new states to mark when a job is queued and when it is actually being worked on.

To avoid breaking current API behaviors we will make these new queued states as internal states, so they will be reported as the old status to anyone who’s not interested in the difference between queued and undergoing states. This can be handled in the base Cinder Versioned Object as the status and raw_status attributes..

Extract from the new cinder/objects/status.py file that will contain the new status and mapping function:

 STATUS_MAPPING = {
     DELETE_REQUESTED: DELETING,
     CREATE_REQUESTED: CREATING,
     BACKUP_REQUESTED: BACKING_UP,
     MIGRATION_REQUESTED: MIGRATING,
 }

 def map_status(to_map):
     if not isinstance(to_map, six.string_types):
         to_map = to_map.raw_status
     return STATUS_MAPPING.get(to_map, to_map)

We’ll also make changes to Cinder’s Versioned Object base:

 class CinderObjectRegistry(base.VersionedObjectRegistry):
     def _register_class(self, cls):
         super(CinderObjectRegistry, self)._register_class(cls)
         if not hasattr(cls, 'status'):
             return

         setattr(cls, 'raw_status', cls.status)
         cls.obj_extra_fields.append('raw_status')

         new_prop = property(objects.status.map_status,
                             cls.status.fset,
                             cls.status.fdel)
         setattr(cls, 'status', new_prop)

 class CinderObject(base.VersionedObject):
     def obj_get_changes(self):
         result = super(CinderObject, self).obj_get_changes()
         if 'status' in result:
             result['status'] = self.raw_status
         return result

This solution will have an additional and very convenient benefit, recovered jobs from failed Volume Nodes will no longer be consider as stuck just because the RPC request had arrived from the API Node to a failed Volume Node. The node will have to actually start working on them to be considered as stuck. As you can imagine this will be helpful for all jobs that were waiting on a resource lock when the node failed, since they’ll now be able to complete instead of being lost.

Lets see a couple of diagrams of how jobs are processed and how failure is handled. Showcase will be for 2 jobs: One to delete a volume – vol1 – and another to create a new volume from that same volume.

The status cleanup should be quite straightforward, because we can differentiate between the different situations:

  • Resource is in queued status:
    • Queued for the same operation we want to do: Then it doesn’t matter if this is from a failed node or a new job that has been just queued, we can proceed with the operation, because we know the operation has never been started and the API cannot queue the same writing operation twice.
    • Queued for a writing operation and we want to read from it: We can proceed with the operation, as the original operation is in the AMQP queue and when it reaches another Volume Node it will wait on the lock until we release the resource.
    • Queued for a different operation and we want to write: We can ignore current operation because this operation has already been completed (duplicate message) by another node that lost its connection to the broker but kept all other connections, as explained later in data corruption section.
  • Resource is in an ongoing status (deleting, creating, etc.):
    • If it’s in deleting, attaching or detaching status, we’ll either perform the deletion, if we want to perform the same method, or fail if we are in any other method.
    • In any other case we set the volume to error and proceed with current operation if we can.
  • Resource is in a rest state:
    • We want to to write to the resource: Abort, since this is a duplicate job like the one mentioned before.
    • We want to read from the resource: Proceed normally

1. Update vol1 status to deleting queued
2. Queue vol1 deletion
3. Worker picks deletion job
4. Deletion job is now waiting for acknowledgement in Message Broker
5. Queue new volume creation from vol1
6. Delete task requests and acquires lock for vol1 deletion
7. Worker picks creation job
8. Creation job is now waiting for acknowledgement in Message Broker
9. Create task requests lock for vol1 and blocks
10. Check status in DB for vol1 and update it to deleting
11. Begin deleting vol1

1. Node dies
2. Connection to Message Broker is broken
3. Deletion job returns to Jobs Queue in the Message Borker
4. Worker picks deletion job
5. Delete task requests lock for vol1 and blocks
6. Failed node lock on vol1 expires
7. Create task acquires lock on vol1
8. Create task checks vol1 status in DB and fails (deleting)
9. Create task releases lock on vol1
10. Creation job is acknowledged in the Message Broker

Caption

1. Delete task acquires lock on vol1
2. Checks status in DB (deleting)
3. Deletion of vol1 on storage
4. Update vol1 status to deleted
5. Release lock on vol1
6. Deletion job is acknowledged in the Message Broker

There are 3 patches in Gerrit that illustrate above mentioned changes:

API Nodes

There are several places in Cinder’s API code where we can have races in the DB updates, and even though these races are quite the rare sight, we still need to fix them.

To make sure we don’t have races on DB updates we’ll implement a compare-and-swap mechanism with retries on DB deadlocks since it’s the best balanced technique to solve these races.

You can read the in-depth analysis of the API race problem, including different solutions to the problem with implementations and test results, that helped chose the compare-and-swap solution in my previous post Cinder’s API Races.

Like with the new queued status we will be adding more functionality to our base Cinder Versioned Object – an ORM_MODEL attribute to know the object’s model and a conditiona_update method – but in this case we will be adding new functionality to our DB api as well.

The idea is to have a method that will ensure atomic changes in the DB if certain conditions of the resource are met. Since the difference in effort is small we should implement a flexible solution that allows multiple field changes in the same atomic operation, that may or may not include the status, and that supports not only to match multiple fields but to match different values for the same field, to match on exclusion – as in value different than or value not in a list – as well as more complex filters – like a subquery from another table.

This implementation is not to my liking, because it’s not as object oriented as it should be, but a slight modification will fix this, and right now for a working PoC is enough.

Changes to the DB:

 class Not(object):
     def __init__(self, value):
         self.value = value

 @_retry_on_deadlock
 @require_context
 def conditional_update(context, model, values, expected_values, filters=()):
     where_conds = list(filters)
     for field, value in expected_values.items():
         if isinstance(value, db.Not):
             value = value.value
             if (isinstance(value, collections.Iterable) and
                     not isinstance(value, six.string_types)):
                 where_conds.append(~getattr(model, field).in_(value))
             else:
                 where_conds.append(getattr(model, field) != value)

         elif (isinstance(value, collections.Iterable) and
                 not isinstance(value, six.string_types)):
             where_conds.append(getattr(model, field).in_(value))
         else:
             where_conds.append(getattr(model, field) == value)

     session = get_session()
     with session.begin():
         query = model_query(context, model, read_deleted='no', project_only=True)
         result = query.filter(*where_conds).update(values,
                                                    synchronize_session=False)
         return 0 != result

CinderObject changes:

     def conditional_update(self, values, expected_values, filters=()):
         if not hasattr(self, 'ORM_MODEL') or 'id' not in self.fields:
             msg = (_('VersionedObject %s does not support conditional update')
                    % (self.__name__))
             raise NotImplementedError(msg)

         expected = expected_values.copy()
         expected['id'] = self.id
         result = db.api.conditional_update(self._context, self.ORM_MODEL,
                                            values, expected, filters)
         if result:
             self.update(values)
             self.obj_reset_changes()

         return result

This is how we can use this new method to update a Volume’s time of deletion and its status to delete requested (queued), taking into account the force argument and if the volume has snapshots – there is no ORM relationship between the Volume model and the Snapshot model, which makes it harder.

     def volume_has_snapshots_filter():
         return sql.exists().where(models.Volume.id == models.Snapshot.volume_id)

     def delete(self, context, volume, force=False, unmanage_only=False):
         # ...
         now = timeutils.utcnow()
         expected = {'attach_status': db.Not('attached'),
                     'migration_status': None,
                     'consistencygroup_id': None}
         good_status = ('available', 'error', 'error_restoring', 'error_extending')
         if not force:
             expected.update(status=good_status)

         # Volume cannot have snapshots if we want to delete it
         filters = [~volume_has_snapshots_filter()]

         updated = vol_obj.conditional_update(
             {'status': objects.status.DELETE_REQUESTED,
              'terminated_at': now},
             expected,
             filters)

There are 2 patches in Gerrit that illustrate above mentioned changes and there is also a spec:

Volume Nodes

Cinder Volume Nodes make use of back-end stored resources in 2 ways, for modification and for reading, and in general they are mutually exclusive. Given current supported node configurations we use local file locks to restrict operations on resources. While this solution is perfect for Active-Passive configurations, it is insufficient for Active-Active configurations where we need to preserve exclusive access between different Volume Nodes.

An example of exclusive access is when we want to delete a volume – writing operation – because requires that there are no readers using the resource, like an operation creating a new volume from that same volume. A case of a reader wanting to prevent writing access during the operation would be when we are performing a backup.

The solution to this problem is to use a Distributed Lock Manager (DLM) to share locks among the Volume Cluster Nodes, and to abstract ourselves from the actual DLM used and to give users more flexibility when choosing a DLM we’ll be using a Distributed System Helper Library called Tooz. Ceilometer is already using Tooz in HA deployment of the central and compute agent services.

This will allow us to use configuration parameters depending on the configuration, for example for Active-Passive we could use local file locks and for Active-Active we could use a real DLM.

Locks are needed not only in the manager, as seen in previous examples, but depending on the back-end driver used they may be needed in the driver itself as well, as proven by a recent VLM LIO discovered bug where we saw that that we couldn’t concurrently access configfs using rtslib.

It is important to note that the lock would only be held during the length of the Cinder code execution, but some operations will still keep the resource being used for writing after the code in Cinder has finished executing, as it is in the case of attaching a volume.

This is not a problem, since we are not trying to queue operations using locks, simply avoiding concurrent access to the same resource while keeping current behavior (some queuing).

So on a write operation we’d acquire the lock, then if the resource is in a state were we can do the operation we’ll set the status of the resource to creating, attaching, etc. But if we cannot perform the operation after acquiring the lock, it will fail as it is doing now, for example in the case of trying multi-attach in a back-end that doesn’t allow it or trying to delete a volume that is attached.

On a read operation we’ll acquire the lock, waiting if there is a write operation, and once we acquire the lock we’ll check if we can do our operation and if we can’t, for example because there is an ongoing write operation, we’ll raise an exception.

For those who doubt that we really need to use a DLM, the answer is yes, because we cannot tell by just looking at the status of a resource if it is being used or not, since only writing operations modify resource status. Even if we decided to go with a DB solution we would be basically using a DLM; atomic changes to the DB are locks on a row at the node level and at the cluster level requires consensus for the update.

There were already 2 PoC patches using Tooz locks as well as a spec for the manager and drivers, and I have added a couple more to show resource locking within TaskFlows, starting and stoping the coordinator in Services and Object locking.

It would probably be a good idea to create a specific decorator that would acquire the locks and for each acquired lock fixes its status according with the operation/method that is decorated. That way we wouldn’t have to worry about fixing resource status when we are going to perform an operation on them.

Service state reporting

Another benefit of using the same host for all cluster nodes is that we won’t need to change the Service state reporting mechanism.

To detect when a service is down we use a heartbeat that is stored in a Service table in the DB; so the only change we’ll need to make to the code is to make sure that Cinder doesn’t break when different nodes from the same cluster simultaneously update the same entry in the DB with a heartbeat. For this we’ll use the same atomic update for Cinder Versioned Objects we used for the API.

That way as long as one of the nodes in the cluster is up the service will be reported as up.

There is one patch that illustrates this.

Capabilities reporting

All Volume Nodes periodically report their capabilities to the schedulers to keep them updated with their stats, that way they can make informed decisions on where to perform operations.

In a similar way to the Service state reporting we need to prevent concurrent access to the data structure when updating this information. Fortunately for us we are storing this information in a Python dictionary, and since we are using an eventlet executor for the rpc server we don’t have to worry about using locks, the inherent behavior of the executor will prevent concurrent access to the dictionary.

We got lucky and we don’t need to update anything in the code, capabilities reporting is ready for Active-Active configurations.

Prevent data corruption

Volume Nodes must try its best to prevent situations were losing network connection with the DB, the broker, the DLM or the storage, results in undesired situations, like a node picking up a job from another node that lost the connection to the broker accessing simultaneously to the same DB entry or back-end data.

There are currently no PoC patches for these 4 different connectivity problems:

1- DB:
This case is when a node has no access to the DB or can only access nodes outside the primary component – it doesn’t really matter if this is a simple partition or an equal weight split-brain situation were the quorum algorithm can’t find a primary component – of a multi-master partitioned DB cluster.

Here we should probably:

  • Break connection to the AMQP broker so that we receive no more jobs and pending jobs are redirected to other nodes.
  • For operations that have not yet changed from the queued status we should release the locks and other nodes will pick up the jobs and hopefully will be able to complete them.
  • For operations that were already undergoing we complete them and release read locks but keep writing locks until DB situation is resolved. An alternative would be to release the locks and let other nodes set status to error.

It’s easy to identify we are accessing a node outside the primary component because we receive a very specific error. In the case of a Galera cluster it is: ERROR 1047 (08S01): WSREP has not yet prepared node for application use. But we don’t need to handle that as an specific case, we can treat it the same way as when we lose access to the DB.

Maybe a good place to do this is in the heartbeat periodic task that updates Service’s model_disconnected in cinder.service.Service.report_state.

2- AMQP broker:

When connection to the AMQP broker is lost we should continue as if nothing had happened. Jobs/messages sent to this node will be sent again to other nodes where they’ll wait on the resource lock, and once this node has completed the operations and releases the locks the other nodes that got the “duplicated jobs” will know that the jobs have already been completed because the status of the resources will not be requested/queued. The exception to this general rule would be for synchronous operations – like attach and detach – that should be rolled back and the locks released since we cannot send back a response to the API Node, but other nodes may be able to do it.

3- DLM:

As soon as the connection to the DLM is considered as lost, all access to the storage back-end and DB should be stopped.

A good system configuration is important to prevent simultaneous access. There are 2 parameters that should be carefully set for Tooz locks, one is the heartbeat rate at which we send lock keepalives to the DLM and the other is the lock expiration timeouts.

Failed heartbeats will be used to determine when we consider to have lost the connection to the DLM, we may decide that only after N heartbeat errors the connection has been dropped.

So it’s important that the lock expiration timeout is greater than N * heartbeat rate to give us time to detect the connection loss and update the DB status before locks are automatically released and another node that is waiting on the resource is allowed to access it.

4- Storage:

For the time being Storage access errors will not be treated in a specific way and monitoring tools should be used to detect problematic nodes – those with 100% failure rate.

Service status reporting

It is important to modify method is_working in the managers to return False when we have decided service is down. By doing this we will stop sending heartbeats to the DB Service table.

Limitations of this solution:

As with any solution where you make compromises there are limitations. In this case performance took second place and more importance was given to compatibility with existing solution as well as implementation time, so there are some things we should be aware.

There are ways to overcome these limitations, but require more effort and are not appropriate for the first implementation. They will be explained in the last section of this post.

One constraint we should, at least keep in mind, is that with chosen job distribution we will no longer be able to address each individual node since all Volume Nodes from the cluster share the same AMQP queue. It doesn’t affect any current functionality, so we don’t need to worry about it right now.

A performance bottleneck of this solution is that we are disregarding how we can have multiple readers for the same resource that do not need to have exclusive access, like when 2 users create a volume using a public volume as the source, and given proposed use of locks we are imposing a one reader at a time limitation.

The road not taken

Multiple alternative solutions were disregarded because they were considered inferior to the chosen ones, but we are going to have a quick view of them for the sake of completeness.

Job Distribution storing a list in Host field

Although this solution is very intuitive, it has too many limitations, because then whenever we add a new node to the cluster we would need to update all resources in the DB to include the new host – otherwise it would not be addressed by the API or Scheduler Nodes – and we would also need to either update all other nodes’ configurations to include the new node host or implement a mechanism to inform other nodes of node addition and removal, as new resources would need to have an updated list of hosts to store in the host field. This would also complicate API and Scheduler, as they would be sending request to only 1 node and would need to implement failover strategies in case addressed node does not pick up the job.

Job cleanup

To replace the cleanup of jobs left behind by failed Volume Nodes upon start-up we could go through all resources from current host, even those that are being worked on by other active nodes, and check if they are in a processing state and if they are we would try a non-blocking acquisition of the locks and if we can get the lock we would consider that the resource is stuck. For this to work we would need to have a lock acquisition timeout greater than the lock expiration timeout, otherwise we could fail to acquire a lock that was just waiting to expire after the owner Node had already failed.

Replacing locks with a GC

At some point it has been suggested to use a Garbage collector, instead of locks, and we would change the state in the DB and leave the operation to be performed at a later time by the Garbage Collector while readers still can continue operating on the resource.

I may be missing something in this approach, but as I see it this solution would only work for delete operations and would not be applicable to all other exclusive access situations, like preventing writing to a resource when it is being used as the source, so it is insufficient.

Replace locks using the DB

Using DB locks is not an option since they are problematic in DB Cluster deployment, so we would need to manually implement the locks with their expiration timeouts.

Even with the considerable benefits this alternative presentds – not adding DLM requirement or any other external tool, easy cleanup of failed jobs using expired heartbeats in the DB – we would be reinventing the wheel by implementing a DLM in the DB and we would be adding a lot of complexity to the Cinder code, moreover if we decide to allow multiple readers for the same resource..

The road that might be taken

Once initial implementation is up and running we will certainly want to improve it to remove limitations and increase performance. There are some interesting options available for following iterations.

Job Distribution

To overcome the individual node addressing limitation, where we cannot address one specific node anymore, we could add a new cluster identification field to the services. It would be a unique identifier to unite all nodes from the same cluster.

We would change the host field to cluster_id + host + back-end. And we would stop using host AMQP topic queue for job distribution, that one would be used for direct individual node addressing, and start using a new topic queue cluster_id .*.back_end.

As usual configuration should be carefully done to use the same cluster-id among nodes of the same Cluster that have the same back-ends and not repeated between different Clusters.

This solution would give us additional node configuration options besides the standard Active-Pasive, Active-Active, N+1, N+M, N-to-1 and N-to-M that we have with the first solution, we could also have hybrid configurations (what we could call a poor man’s configuration).

In the next example Node 3 would be a multi back-end node providing redundancy for nodes 1 and 2.

Possible Job Distribution using Cluster-id

Possible Job Distribution using Cluster-id

Job recovery

Although I am in favor of using fail-fast solutions over recovering ones for Cloud Infrastructure, because the later ones drastically increase complexity. There have been efforts in the community to add persistence to Cinder TaskFlows in order to allow failed job recovery, and maybe is something we should explore a little more before dismissing it.

Performance improvement

Like we mentioned before, proposed implementation limits the number of concurrent readers to 1, but this is very inefficient, it would make more sense to give multiple readers simultaneous access resources and just prevent writers from accessing it while there are readers. Reverse situation would also hold true, if resource is undergoing a writing operation then access would be restricted to other writers or readers.

We would need to incorporate a new type of locks in Tooz, the classic read-write locks, as well as add specific driver implementations.

Required functionality for these new locks is:

  • Allow readers to acquire lock when it’s free or only other readers have acquired it.
  • Prevent a writer from acquiring a lock if there are readers that haven’t released the lock (or the lock hasn’t expired) .
  • Prevent a second writer from acquiring a lock if there’s already a writer who has it.
  • Allow setting a limit on the number of readers that can concurrently access a lock. Setting a sensitive value would prevent resource hogging.

Under certain circumstances could prove useful to allow multiple writers to acquire the same lock, like we do with the readers; for example for multi-attaching.

In addition to fixing our problem in Cinder, adding this new type of lock to Tooz would mean that other OpenStack projects that require similar functionality can reuse it.

API Fast Fail

A lot of effort is being directed to change how API behaves to make it capable of Fast Failing when resources are busy and requested operation cannot progress.

The idea here is to return an error to the caller when the resource is busy, as in VolumeIsBusy, instead of getting the request queued in the Volume Node waiting for the locks to be released.

This is a really big change that affects the Nova-Cinder interactions, the cinder client as well as any custom script people may have out there that rely on current behavior.

Even though we are not changing any documented behavior it could be argued that current behavior is an implicit contract that should be kept in future releases.

This is a long term change that can greatly benefit OpenStack’s Nova-Cinder integration but that requires changes to Nova, probably with Specs, as well as Cinder, so it should be worked on in parallel with proposed changes.

For fast-failing we’ll probably want to check locks from the API as well to see if resources are being used by readers, since the status field is not enough to know it, but this is a complete different story that is as complex as the whole HA Active-Active issue. And we could also create a new reading status to clarify when it is being used for reading.

 


Modified picture from source: “Follow the yellow brick road!” by duncan c is licensed under CC BY-NC 2.0