Last week I presented a possible solution to support Active-Active configurations in Cinder, and as much as it pains me to admit it, it was too complex, so this week I’ll present a simpler solution.
Change of heart
I really liked last week’s solution to allow Active-Active HA configurations in Cinder, but it was brought to my attention that the complexity it added to the component was not worth the little benefits it brought (like recovering queued jobs).
It didn’t take me too long to go from the initial denial stage to acceptance, after all it was quite clear that the solution left something to be desired regarding complexity, and so a simpler solution was required, and here I present what I believe is a reasonable alternative.
I’ll follow the same format as in last post and give solutions to the same problems, so it will look familiar to those who read the old post. Although in this case I didn’t have time to create any patches or diagrams, as I wanted to have it ready for the Cinder Midcycle Sprint.
So how will we distribute jobs between the different nodes in the cluster?
Since we still don’t want our API or Scheduler nodes to worry about how many nodes or which particular ones are available in a cluster at a given time to distribute the jobs, we will be using the same method as we did before, topic queues, the only difference is that we’ll be changing the topic of those queues. Instead of using host@backend as the queue topic we’ll change them to cluster@backend.
Cluster will be a new configuration option that must be kept identical among all nodes of the same cluster and if not configured will take default value of the host configuration key. As with any other Cinder deployment host key should be unique for each node.
Having cluster key take the host key value as default will have multiple benefits, non Active-Active configurations will see service reported as they were before host@backend and if a single node configuration or an Active-Passive deployment wants later on to add an additional active node to have an Active-Active configuration it can be done without downtime, just by configuring the new node’s cluster key with the first node’s host key. It’s not clean, but it will help the admin in a pinch until downtime is acceptable.
To handle the cleanup we’ll require a new field in all resource tables (volumes, backups, snapshots, etc.). We’ll call this new field worker and it can take 3 different values:
- Empty value: Nobody is working on the resource
- cluster@backend: The resource is queued to be worked on
- host@backend: The resource has been picked up by a worker
We could probably make it work without the cluster@backend value, but I like to keep track of where things are.
We will also be using recently introduced previous_status like the backup operation does for in-use volumes.
Normal workflow for an operation on an existing resource will be like this:
- API takes in the request, and it will change not only the status but also the previous_status and the worker field to match resource’s host field value (cluster@backend) before doing the RPC call that queues the message in the Volume topic queue.
- When a Volume Node receives the job it will change the worker field on the DB to host@backend and proceed as usual.
- Once it has completed the operation it will blank the worker field and on successful completion the previous_status field will be restored to the status field.
This fine and dandy, but how do we do the cleanup? Mostly like we are doing it today, only with small changes.
When a node starts up it looks in the DB for resources that have his host@backend in the worker field and does a compare-and-swap blanking of that field to make sure nobody else tries to fix it and then for resources that only require a DB changes those changes are made, and for resources that require operations (like deleting) we set the worker field to cluster@backend and do an RPC call for the operation to be performed. This way we distribute the workload between all the cluster’s nodes.
Again, this is nice, but what about cases where the node is not brought back up or where there is no failover to a Passive node? Those cases will be handled by the Scheduler as I explain later in service state reporting section.
Another option, instead of adding the new field to all the resource tables, is to create a specific table for resources that are being worked on instead. If we opt for this solution we would need to store the resource ID, the worker, as well as the table the ID is from. This solution requires less changes and has the benefit of making it easier to get all resources being used by a node since you only have to check in one table, although makes it harder to know who is working on a specific resource and the atomic change of status and worker a pita.
Here we’ll need to remove the races just like I explained in the homonymous section of last week’s post. But we may also need additional changes on this nodes depending on how we resolve mutual exclusion for resources.
I am not going to go again over how we need to change current local file locking mechanism, it’s already explained in the Volume Nodes section of previous post.
There is a division in the community between those that think we should use a DLM to solve resource locking and those that want to avoid cloud administrator from having to deploy and configure even more software (Redis, ZooKeeper, etc.).
My personal opinion is that using a DLM to get Active-Active is great as an intermediate solution until we reach the final stage of Active-Active, because it can be implemented using Tooz effortlessly and that is what I like about this option. We can get a first solution quickly and then work on removing locks from the Volume Manager and the drivers.
As a side note, some drivers may be able to remove locks once we remove API races and add a few missing locks.
Since the DLM solution would only be affecting administrators of Active-Active deployments, it may not affect that many initially.
But it’s always a good thing to think of alternatives, just because I like it doesn’t mean it’s the best thing to do, so there’s this alternative that I like and it’s to use the resource status.
We can add a new reading status for all those operations that currently have no status changes in the resource, like cloning, this way the API would know that the resource is not available and could fail fast.
For this to work, resource locking – setting status to reading – should happen in the API Nodes instead of the Volume Nodes to prevent having multiple API calls trying to work on the same resource simultaneously and we should also set the worker field in the resource so cleanup on failure is possible. The cleanup for a reading resource is basically recovering the status from the previous_status and if this status is in-use call the ensure_export method on the resource.
This is quite the change, as it also means that all operations in the Volume Nodes should need to remove this reading status from the resource. Even if it’s some work, at least it is not complex work, so it’s unlike to derive in bugs.
As for possible problems with Nova, I think they would probably be minimum, because we are now marking all resources that are in use and Nova would even know that resources used for reading are not available when it checks. So that only leaves us with problems that come from races inside Nova, from the point where it checks the status of a resource and the point where it makes the call to use the resource. And in time even those will be properly handled with Hemna’s work.
Another alternative would be working with the new workers table, but I think that one would be more complex.
To be able to detect when nodes die, and do the cleanup for nodes that are not brought up (or that take some time to do so), each node cluster will be giving its own reporting in a different row in the DB.
We can use current host field and report services in the form of cluster@backend@hostend or create a new field called backend or cluster that will contain the cluster@backend section and leave the host part in the the host field.
That’s actually irrelevant at this point, as it’s just an implementation detail, in any case Scheduler Nodes will have a periodic task that will check the DB contents and create a dictionary with keys cluster@backend and store in the value if it’s up and the different nodes with their information and whether we have done the cleanup for that node or not. Nodes that are up will set the cleanup_done to False.
A service is up if any of the nodes from the cluster is up.
For those nodes that are down we will perform the cleanup just like we do in the Volume Nodes startup and then set cleanup_done to True so we don’t check for jobs on the next time the task is run.
It doesn’t really matter if multiple schedulers try to do the same node’s cleanup at the same time or if a scheduler and the previously passive node that is now active are simultaneously repairing the resources, since we are doing an atomic change in the DB with compare-and-swap and skip failures only one will do the cleanup of the resource.
If we report our capabilities using cluster@backend as the host we won’t need to make any changes like I explained before.
Prevent data corruption
Unlike with last week’s solution, preventing data corruption is much easier here, and that’s one of the main reasons why this solution is simpler.
We don’t care if we lose connection to the storage back-end, we don’t care if we lose connection to the message broker.
If we lose connection to the DB we should stop all ongoing operations, and the easiest place to do this is in the heartbeat periodic method. There we can stop all operations with the backend.
When using a DLM we should also check when the connection is lost and stop all operations when the connection is lost and stop sending heartbeats since we can’t do anything, this is the same as when we lost the DLM connection in the complex solution.