So you may have read my previous post about iSCSI multipathing in OpenStack and decided to try the new code, and everything seemed to be working fine, but then you start pushing it more and more and you find yourself back at a point were thinks don’t go as expected, so what’s the deal there? Were the issues fixed or not? The sort answer is yes and no, but let’s see what I mean by this.
At Red Hat not only do we fix bugs as they are being reported by our customers, but we also like to be proactive and analyze trends in our BugZilla cases to detect clustered cases around specific areas of code and/or functionality, and we have detected that the multipath feature was not as resilient as we would like it to be.
So the storage team decided to dedicate engineering resources to poke it until it reached our quality standards, and the work I discussed on my earlier post was part of this effort where we detected that devices and multipaths were appearing unexpectedly in our systems due to rescanning issues, both triggered by Cinder without any filtering whatsoever as well as by the Open-iSCSI initiator on login and on reception of AEN/AER packets.
So narrowing the rescans in Cinder and adding the feature in Open-iSCSI and support for it in Cinder would resolve those annoying issues, but unfortunately they are not the only multipath issues in OpenStack.
Into the wilderness
Even though no battle plan survives contact with the enemy, and code is in a way similar when exposed to the messiness and craziness of the real world, we must do our best to make it as robust as possible, and while the current code works fine in most situations, unfortunately there are still cases where it doesn’t. You could see it as our current code behaving as expected when everything in our cloud is working fine or relatively fine, but when things go awry so does our code.
This behavior is not good enough for us, and certainly not for our customers, so we need to put the code into the wilderness, see how it breaks, fix it, and repeat.
Putting on band-aids
Once the scans issues were already taken care of it was time to give the code a visual pass and an empirical one. So after a thorough read of the code came the poking of the connect and disconnect code simulating network errors until it broke.
Initial approach was reading and poking the code, finding an issue and fixing it, and a good number of issues were found this way, here are some:
- Host, target, channel detection used for rescan had issues with the shortcircuit and matching.
- Individual paths are not flushed or removed if not part of the multipath on disconnect
- iSCSI nodes and sessions are not cleared if not part of the multipath on disconnect
- Failure to get devices on connect would leave nodes and sessions behind
- Unexpected exceptions on connect would result in nodes, sessions, and possibly devices not being cleaned
- On multipaths, if a device’s path went down after being populated in sysfs we would fail, even if other devices were accessible, also leaving an unclean system.
- On multipaths, disconnect could fail if chosen individual device to get the wwn was one with a failed path.
- We are not waiting for the system to remove the devices before actually logging out, this could leave symlinks behind.
Some more issues were found and fixed, but there were still some issues crawling up, so instead of keeping on this path the approach change to find problematic sections of code were one or more issues had been found and refactor the code with a more robust solution, and it was going well, since it:
- Stopped using
multipath -las mechanism to get the multipath name since it would be extremely slow on flaky network connections and doesn’t work when called with a path that is down.
- Stopped using
scsi_idtool to get the wwn.
- Manually force the creation of a multipath even when there’s only one device available, so the others can populate the system if the connection to them is slow.
- Change removal and check all devices to find the multipath.
- We no longer lose data if the disconnect fails.
Even after all that, there were still issue that could not be cleanly fixed with the existing code so, after flipping the table and putting it back up, the decision to change the whole connect and disconnect mechanism in OS-Brick was made.
This new code there will bring considerable advantages in terms of speed and reliability in the future, mostly when things are not so great. The sort version would be that the code would now try to make logins in parallel and the use of CLI tools has been reduced to a minimum in favor of using information readily available on sysfs.
A longer description of the improvements would be:
- Clean all leftovers on exceptions on a connection.
- Parallelize logins on multipath to increase speed on flaky network connections.
- As long as there is at least one good path when working with multipath we will return a multipath device instead of a single path device, which helps with temporary flaky connections.
- Don’t use the rescan retry parameter as login retry on multipath connections so both single and multipath cases are consistent.
- We no longer rely on just one device to get the wwn and look for the multipath. This would be problematic with flaky connections.
- No more querying iSCSI devices for their WWN (page 0x83) removing delays and issue on flaky connections.
- It’s no longer a problem for the mechanism the fact that a device exists but is not accessible.
- We use links in
/dev/disk/by-idto get the WWID on connect, so we make sure there are no leftovers on disconnect, but we no longer use symlinks from
/dev/mapperto find devices.
- We no longer need to rely on the WWN to determine the multipath, we have the session and the LUN, so we trace the devices and from those we get if they belong to a multipath.
- Stop polluting the logs with unnecessary exceptions from checking if the node or session existing.
- Action retries will now only log the final exception instead of logging all the exceptions.
- Warn when a multipath could not be formed and a single device is being used, as performance will be degraded.
- We no longer do global rescans on single connections that could be bringing in unrelated and unwanted devices (
iscsiadm -T iqn -p portal --rescan).
- Fix scan mechanism that would not request all scans when the same iqn was shareed between portals and could send a scan request to the wrong IP if they shared the same iqn.
- When doing single pathed connections we could end with a multipath because we didn’t clean up unfruitful connections.
- Common code for multipath and single path disconnects.
- No more querying iSCSI devices for their WWN (page 0x83) removing delays and issues on flaky connections.
- All devices are properly cleaned even if they are not part of the multipath.
- We wait for device removal and do it in parallel if there are multiple.
- Removed usage of
multipath -lto find devices which is really slow with flaky connections and didn’t work when called with a device from a path that is down.
- Prevent losing data when detaching, currently if the multipath flush fails for any other reason than “in use” we silently continue with the removal. That is the case when all paths are momentarily down.
- Adds a new mechanism for the caller of the disconnect to specify that it’s acceptable to lose data and that it’s more important to leave a clean system. That is the case if we are creating a volume from an image, since the volume will just be set to error, but we don’t want leftovers. Optionally we can tell os-brick to ignore errors and don’t raise an exception if the flush fails.
- Add a warning when we could be leaving leftovers behind due to disconnect issues.
- Action retries (like multipath flush) will now only log the final exception instead of logging all the exceptions.
- Flushes of individual paths now use exponential backoff retries instead of random retries between 0.2 and 2 seconds (from oslo library).
- We no longer use symlinks from
/dev/mapperto find devices or multipaths, as they could be leftovers from previous runs.
- With high failure rates (above 30%) some CLI calls will enter into a weird state where they wait forever, so we add a timeout mechanism in our
executemethod and add it to those specific calls.
This is a change of considerable magnitude, since we are basically adding everything anew, so not only will it require thorough reviewing before merging, but I’m also sure that reviewers will be reluctant to merge it even after reviewing it, since they cannot be sure how much of an improvement this new code actually is -better the devil you know. That’s why a good number of tests have been performed on the new code to put everybody at ease.
Tests have been performed with 2 different iSCSI backends, QNAP and XtremIO, with multipath and single path configurations, with and without error simulation, and doing the connect/disconnect on Nova and in Cinder.
Errors were simulated using iptables filters to drop input packets with a statistic match on random mode at different rates, and the drop percentages used were:
- 0% drop in all channels
- 10% drop in all channels
- 20% drop in all channels
- 100% drop in all channels
- 20% drop in half the channels
- 20% drop in the other half of the channels
- 100% drop in half the channels
- 100% drop in the other half of the channels
Due to existing issues in the current code and the Open-iSCSI auto-scan mechanism the current code, unlike the new code, was only tested with the QNAP backend as a reference for comparison. The following table shows the results of the tests cases were we are dropping packets for create -average success rate, percentage of times there were leftovers and that a multipath was used- for the new and current code, as well as results for the attach-detach tests.
Graph representation of test results:
It’s worth mentioning that the success rate and the create MultiPath used includes the test case were all paths are down, which makes it impossible to succeed, so the real maximum rate is 85.71%. Also note that the detach success rate is different, as it is relative to the attach success rate, since we cannot say a detach has failed just because we couldn’t do an attach.
It could happen that the success rate for the new code is lower than with the current code since we have added a sensible timeout of 2 minutes to flush the multipath data -we still do 3 tries-, while the current code will just wait forever, which isn’t reasonable.
The XtremIO tests were analogous to the QNAP tests, there were no leftovers found in any of the tests run regardless of the packets dropped.
Even though individually the number of tests run for this post for the new code were small, 10 for each test adding up to a total of attach/detach tries of over 300, the total amount of non measured tests were considerably higher, and none of those had any remaining leftovers either.
Based on these tests the new code seems to be more reliable than the old one, but there could still be issues these tests have missed, since the number of systems at my disposal is limited and the nature of the statistical packet drop in the tests does not guarantee that all potential cases are covered. So hopefully some other vendors will try the patches with their backeds and provide feedback on the results.
In terms of performance even though the code in OS-Brick tries to parallelize the logins we find ourselves limited by the serial nature of current OSLO privsep daemon, so even if we are requesting 4
iscsiadm login calls they get serialized and will not be actually processed in parallel. From the point of view of the patches in OS-Brick this is not a big issue, since the code will just end up executing serially like it is doing now, and once privsep supports running commands asynchronously we will be able to immediately benefit from it.
The patches are labeled as WIP because the unit tests are not ready, as I’d like to get some feedback before doing that, since it will be a considerable effort given the amount of code that has been changed and how much the mechanism has diverged from the old one.
Additional patches will be necessary in Cinder and Nova to make sure we use the new
force parameter when calling
disconnect for cases like creating a volume from an image, where we want to do a thorough cleanup on disconnect error, since we don’t mind losing data but we don’t want to have any leftovers in our systems on case of a failure. For these tests we modified the Cinder code to do so, as well as to ensure that we try to terminate the connection and remove the export even if the disconnect fails when we are forcing the disconnection, as can be seen in these 2 patches: Do proper cleanup if connect volume fails, and Add support for OS-Brick force disconnect