Cinder Active-Active HA – Newton mid-cycle


Last week took place the OpenStack Cinder mid-cycle sprint in Fort Collins, and on the first day we discussed the Active-Active HA effort that’s been going on for a while now and the plans for the future. This is a summary of that session.

Just like in previous mid-cycles the Cinder community did its best to accommodate remote attendees and make them feel included in the sessions with hangouts, live video streaming, IRC pings as reminders, and even moving the microphone around the room so they could hear better, and it’s an effort that I for one appreciate.

Above video includes 2 sessions, the first one is the discussion about the mascot, and the second is the one related to Active-Active, which begins 23 minutes and 34 seconds into the video, you can skip it manually or just jump right in there here.

We had a clear agenda on the Etherpad prior to the meeting, and even if we didn’t follow the order closely, we did cover everything in it and some more:

  • Status of patches
  • Status of testing
  • Help with reviewing
  • Set goals for Newton: DB changes

Getting the code in

During the Austin summit we reached an agreement related to the job distribution for clusters, and the patches were changed accordingly, so we are now ready to start merging them; but people are having problems reviewing patches because there is too much to process.

Even if the Cinder A-A blueprint has links to all specs and patches, it’s hard sometimes to understand what each patch of the series is trying to accomplish in the grand scheme of things.

This is understandable because with the new job distribution approach now patches from different specs are interleaved in the chain of patches. That’s why I try to keep specs up to date with any change or additional information I deem relevant while coding the feature, and one must pay attention to the commit message on the review to see to which one of the specs the patch is related to.

But this may not be enough, and maybe even after having read the specs there are things that aren’t clear in the patches, and that would mean that either the specs lack information or are not clear enough and needed more work, or the problem lay in the patches and they required more or better comments, or refactoring to make them easier to follow. In any case, we agreed that reviewers should not be coy about things that are not clear and I would make the appropriate changes to specs or code to fix it.

There were a couple of controversial patches at the beginning of the chain of patches that would delay the merging of the rest of the patches until an agreement could be reached. So Dulek suggested to remove them from the chain and rework the other patches as necessary, if this wasn’t too much work, to move things along. Since this mostly meant adding some more code to a couple of patches seemed reasonable and we agreed to have them separated from the chain.

We want to review and merge as many patches as possible in Newton, but we want to keep the risk of disruption low, so a list of low risk patches that can be reviewed and merged in N is necessary to focus our efforts.

Testing

Quoting Bruce Eckel, “If it’s not tested, it’s broken”, so we are going to need some multinode automated tests at the gate, and the Tempest patch for this is under review, but that won’t be enough, because Tempest is meant for API validation, not negative tests, and much less the kind of tests that are required to verify the correctness of an Active-Active cluster on failure.

To have proper tests we’ll need a fault injection mechanism, but that’s a big effort that probably will not be in time for the tech preview release of Active-Active HA, so for now we’ll go with manual testing forcing corner case error conditions as well as normal error conditions and non error conditions.

We agree that the tempest tests together with this manual testing will be enough for the tech-preview, but the manual process we follow will need to be properly documented so people can see which test cases are being checked -managing their expectations accordingly and allowing them to suggest new tests- and serve as a starting point for the automated tests.

In order to coordinate testing, resolve doubts, and create focused reviews, Scottda suggests creating a Working Group that will meet weekly on IRC, and we all vote in favor.

Drivers

We discussed multiple topics related to drivers and Active-Active configurations, the first one was that they need to be ready for the feature if they don’t want to be at a disadvantage for not being able to support this feature. And they can already start testing their own drivers today with existing patches, although there is not documentation to help them in the process.

Szymon pointed out that there is an issue with drivers inheriting from the same base when the base class uses external locks, because that means that changing the base class to use distributed locks changes all drivers at the same time, and that’s not ideal as some drivers may not want or need them. The case presented was the RemoteFS driver and inheriting drivers

Proposed approach is creating 2 different classes, one that leaves all drivers as they are, using local locks, and another class that uses Tooz abstraction layer for drivers that want to use distributed locks. To this issue I suggested an alternative approach changing the base class to use Tooz abstraction layer which uses the default local file locks, equivalent to the ones the drivers are using, and moving the Tooz configuration options from the DEFAULT section to the driver specific section, that way we can have a multibackend deployment where one driver uses local locks and the other uses distributed locks.

Review sprint

On Friday, after all the sessions, the Cinder team decided to give a big push to Active-Active patches, so we all synchronized -those in Fort Collins and those attending remotely- for a reviewing sprint where I would be updating patches and replying to gerrit and IRC comments as they were going through the patches. And it was a great success, as we were able to go through the 12 patches -many of considerable size- that we wanted to get merged in Newton.

Some of the patches haven’t merged yet, but they are in the process of merging and being verified by Zuul at this moment.

Actions

During the meeting we agreed upon the following actions:

  • Remove the 2 controversial patches from the chain and update other patches ➞ Done
  • Create a list of low risk patches that can merge in N ➞ First patch in the series that shouldn’t merge is https://review.openstack.org/303021
  • Create a document with manual tests
  • Look into fault injection
  • Pick a time for the Work Group’s weekly meeting ➞ Will be held on #openstack-cinder on Tuesdays at 16:00 UTC, as announced on the dev mailing list

Other

Session notes were taken in Etherpad as usual and they are split according to the day they were help, but they are all linked in the mid-cycle Cinder Etherpad and are available for reference.

All videos from the mid-cycle are also available at the openstack-cinder YouTube channel.