iSCSI multipath issues in OpenStack 3


Multipathing is a technique frequently used in enterprise deployments to increase throughput and reliability on external storage connections, and it’s been a little bit of a pain in the neck for OpenStack users. If you’ve nodded while reading the previous statement, then this post will probably be of interest to you, as we’ll be going over some iSCSI multipath issues found in OpenStack and how they can be solved.

multipath

Pain in the neck

Multipath is a great feature that we all like to use, because it provides not only an excellent way to increase the throughput in our remotely attached storage, which is always nice, but also adds fault tolerance to our system, which is a must for enterprises. But what happens when things don’t go as expected? Then it’s sighing time, you roll up your sleeves and you start digging through your logs, increasing your logging levels, capturing and analyzing network packets, checking your storage array’s management tool, looking for updates… You probably know the drill.

In OpenStack block storage devices are controlled by Cinder, and attachments/detachments are done by the os-brick, both in Nova and Cinder, and multipathing has been a bit of a bittersweet love for os-brick and OpenStack in general, as it is unfortunately common to have device map leftovers in our systems when using multipath for our instances and for the data plane in Cinder.

Some time ago I worked on some issues related to multipathing in Fibre Channel’s multipath cleanup and multipath rescan, as well as iSCSI multipath cleanup, but even if that alleviated the problem it didn’t seem to fix all the problems, so last week I decided to put aside my work on Cinder Active-Active and dig into the iSCSI issues with the intention of finding a solution to the issues that are haunting us.

The setup

For my setup I decided to use a driver that has been around for a while, since they are less likely to have unexpected issues and use the slowest possible system and connection to the storage backend to expose issues that would not be revealed under normal circumstances if we were using a normal system.

I deployed all OpenStack service I required -Keystone, Glance, Cinder, Nova- and all other services -DBMS, message broker- into a constrained VM – 8GB of RAM and 1 CPU – that connected to a XtremIO backend located over 3000 Km away using a VPN, which resulted in a ping of more than 100ms.

I want to be very clear that the issues exposed in this post are not specific to only the XtremIO driver, as they happen in other drivers as well.

Most tests were performed using the create volume from image operation, since this would not only attach the volume using multipath in Cinder, but it would also verify that the attached volume is writable.

Friendly names

When setting up our iSCSI multipath configuration in Linux we can decide we don’t want to use the volume’s WWID for the device mapping if we prefer more user friendly names like mpatha or mpathb instead of 3514f0c5a51600542 and 3514f0c5a51600548. This is accomplished with the user_friendly_names configuration option.

Cinder doesn’t recommend using friendly names because, as it is mentioned in the code our multipath flushing doesn’t remove the entry if friendly names are used:

           # There is a bug in multipath where the flushing
           # doesn't remove the entry if friendly names are on
           # we'll try anyway.

This seemed like a good starting point for the multipath investigation, because even though there is a simple workaround, there can always be people that don’t know about this and end up having serious trouble with this, because not flushing the multipath device may leave device map leftovers and, worst of all, may lead to data loss since the data is not flushed before disconnecting.

As I reported in the Launchpad Bug the problem comes from the way we are doing the multipath flush,as we are using the WWID as device identifier regardless of the mapping mechanism we are using. This will work fine when the system is using the WWID for the mapping, as the device name will be a match for the WWID we are using in os-brick, but with user friendly names the WWID doesn’t identify any device, since the name of the device is in the form of mapthYY, and unfortunately the multipath CLI tool does not return an error exit code when flushing a non existent device, so we mistakenly think we have flushed the device when in reality we haven’t.

This affects both iSCSI and FC connections, and fortunately the fix is quite easy.

Map in use

Because the tests were performed on a resource constrained system with a slow data connection I was able to see an unexpected error during the detaching phase of the operation. Just as os-brick was flushing the multipath device with multipath -f devicename we would see that the operation failed with Exit Code 1, but no more information was provided, and the disconnection was not halted after this error occurs.

Again, not flushing properly a multipath device can lead to data loss and device maps leftovers.

After changing the logging to report more information when the operation failed and trying a couple of times until the operation failed again I could see what the error was:

WARNING os_brick.initiator.linuxscsi [...] multipath call failed exit 1: Unexpected error while running command.
Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf multipath -f mpathap
Exit code: 1
Stdout: u'Feb 10 05:45:07 | mpathap1: map in use\nFeb 10 05:45:07 | failed to remove multipath map mpathap\n'
Stderr: u''

This is clearly a transient error, as mentioned in the Launchpad bug linked to the limitations of the system, so a simple retry mechanism with exponential backoff mechanism fixed. In all my tests, whenever this error happened, the multipath flushing would succeed on the second try, after the 2 second backoff delay.

As with the previous issue this has a very simple fix, and even though it’s unlikely to happen in most systems, it’s always good to cover all corner cases we are aware of.

Cinder scans

Similar to the FC multipath rescan issue, I also found that the iSCSI multipath rescans was also afflicted with this issue , because we are using iscsiadm --rescan for both node and session which will do a full rescan of the target node and add any new LUNs that are not already present in our system.

This means that if we do a rescan of the bus during an attach operation just when a detach operation is at the precise moment between the removal of the device from the system and the removal of the export in the storage backend we will be unintentionally readding the device to our system.

The flow would be like this, we are creating volume A from an image, so we create it in the storage backend, export the LUN, attach it to the system, download the image, copy the image to the attached volume, and detach the volume from the system, but the volume is still exported in the storage backed. And at this point another operation that is creating volume B from image has reached the point where it’s going to attach volume B to the system, so it does a rescan and sees that not only volume B is there, but also volume A, so it adds them both to the system before continuing. At this point creation of volume A continues and removes the export from the storage backend, which leaves us with a multipath for volume A that is not actually exported in the storage backend.

Since this was a very specific situation and it isn’t easy for it to happen on its own, some modifications were introduced to the code during the testing of this fix. A synchronization mechanism was introduced to the create volume operation to force this unfavorable situation, so the first creation operation would wait for a signal after the local detach of the volume from the system, and the second creation operation would signal the first after attaching the volume, moment at which we would have seen the error.

The solution to this problem would be the same as the one I used for the FC multipath rescan issue, to narrow our scans of the bus in the target node to only search for the specific LUN we are interested in, thus preventing the system from inadvertently introducing unwanted devices.

The patch for this problem is a little more complex that the previous two, but I believe it’s still only of moderate complexity.

Once I had coded the solution and went on to test it in the system I noticed, to my surprise, that the problem hadn’t been removed! How was this possible? I was certain that we should narrow the rescans and that the rescans were being properly done, so what was going on here? Was there something else going on that I wasn’t aware of?

iSCSId scans

Since I could think of any reason why this would be happening in any of the Cinder logs, I decided to go down protocol lane, and capture all communications between my test system and the storage array to analyze the problem. Another option would have been to increase the SCSI log levels or run iscsid in debug mode, but I wanted to make an abstraction from the system and start looking for the problem from the bottom up.

So looking at the packets I found that when we did the iscsiadm --login a full scan request was being sent to the storage backend and then the LUN that had just been removed from the system was returned as still exposed by the storage backend and then added to our systemthe device was no longer being added.

As I see it there are 2 solutions:

  • Changing Cinder to take into account for this undesired behavior.
  • Changing iscsid.

Changing Cinder

This would be the easiest solution for me, because I know the code in Cinder well enough to do this and I could tell by looking at the issue that this would be just a couple of lines of code, since all that would be required would be to add a lock to prevent detach and attach operations from happening concurrently in Cinder, just like we do in os-brick.

The solution is simple in Cinder, as we would override the _detach_volume and _attach_volume in any driver affected by the issue -those that use the same target node to export multiple LUNs- and add the locking mechanism:

    @utils.synchronized('_attach_volume')
    def _detach_volume(self, *args, **kwargs):
        super(DriverClassName, sel)._detach_volume(*args, **kwargs)

    @utils.synchronized('_attach_volume')
    def _attach_volume(self, *args, **kwargs):
        super(DriverClassName, sel)._attach_volume(*args, **kwargs)

That would fix the issue in Cinder, but unfortunately it would not solve the issue for compute nodes where attach and detach operations will still be run without synchronization with the removal of exports.

So this would be a fix from Cinder’s perspective, but I think that we should be aiming for a global solution that would benefit the whole OpenStack ecosystem and not limit ourself to our own projects, which brings me to the next solution.

Changing iscsid

Wanting to fix this globally I looked into the iscsid source code – I felt so nostalgic and happy looking at C code again 😉 – and I found the problematic code:

static void session_conn_process_login(void *data)
{
        [...]

        if (session->r_stage == R_STAGE_NO_CHANGE ||
            session->r_stage == R_STAGE_SESSION_REDIRECT) {
                /*
                 * scan host is one-time deal. We
                 * don't want to re-scan it on recovery.
                 */
                session_scan_host(session, session->hostno,
                                 c->qtask);
                session->notify_qtask = NULL;

                log_warning("Connection%d:%d to [target: %s, portal: %s,%d] "
                            "through [iface: %s] is operational now",
                            session->id, conn->id, session->nrec.name,
                            session->nrec.conn[conn->id].address,
                            session->nrec.conn[conn->id].port,
                            session->nrec.iface.name);
        } else {
                session->notify_qtask = NULL;
                mgmt_ipc_write_rsp(c->qtask, ISCSI_SUCCESS);
        }

        [...]
}

We are calling session_scan_host on login, so I modified the method to accept an extra parameter to support doing only LUN 0 scans instead of full scans and changed the call to not do the full scan on login and all other calls to the method to do full scans:

pid_t iscsi_sysfs_scan_host(int hostno, int async, int full_scan)
{
       char id[NAME_SIZE];
       char *write_buf = "- - 0";
       pid_t pid = 0;

       if (full_scan)
               write_buf[4] = '-';

       [...]
}

After compiling iscsid and modifying it in my test system I confirmed that the device was no longer being added, so this could be a possible path to fix this.

AEN/AER

I continued doing more tests, and just as I was feeling good about the state of things I began broadening the type of tests and I found that I was having leftover device maps leftovers, and once again I had to go down the rabbit hole of the protocol stack.

In this case the issue was happening exactly in the same spot, but not when we were not doing the login, as this was already fixed, but instead when we already had a session open with the target node and we attached another volume.

So the flow would be like this, one create operation detaches the volume and reaches the point right before removing the export,another create does the attach and reaches the point where it’s copying data to the attached volume, and when a third create operation created the export for its volume we could see the maps from the first volume create come back to life.

Analyzing the iSCSI packets revealed that when we created the third’s volume export the storage backend sent an Asynchronous Event Report (AER), also known as Asynchronous Event Notification (AEN), was being sent from the storage backend with Sense Data indicating Reported Luns Data Has Changed to report that the new LUN has been added to the target node.

Following this AER packet we could see the initiator sending a Report LUNs request and then proceeding to send inquiries of all reported LUNs to add all existing LUNs to the system.

Once again we have multiple ways to solve this:

Changing Cinder

This issue would get fixed in Cinder if we apply the locking solution proposed for the iscsi scans section above, but this would still fail to solve the issue for non Cinder nodes.

With udev rules

ANE/AER are sent also on LUN removal, so we could create a rule that would remove the devices when a change is detected and the device is no longer accessible.

I am personally against this solution, because I cringe at the though of automatically removing mappings in these situations/environments.

Changing iscsid

So once again we find myself looking for the solution in iscsid, and the code that was doing the automatic scan was in the asynchronous handling:

static void iscsi_recv_async_msg(iscsi_conn_t *conn, struct iscsi_hdr *hdr)
{
    [...]

    case ISCSI_ASYNC_MSG_SCSI_EVENT:
        senselen = (buf[0] << 8) | buf[1];
        buf += 2;

        if (!scsi_normalize_sense((uint8_t *)buf, senselen, &sshdr)) {
            log_error("Could not handle AEN %d. Invalid sense.",
                  async_hdr->async_event);
            break;
        }

        if (sshdr.asc == 0x3f && sshdr.ascq == 0x0e)
            session_scan_host(session, session->hostno, NULL);
        break;

    [...]

So I modified the session_scan_host method as well and looked at all instances of calls to this method and found that we are also doing this on driver start, which will also become a problem when we’ll be able to run the Cinder Volume service in Active-Active mode, which we are so close now.

After making these changes I no longer had device map leftovers, hurrah!!

Once it had been confirmed that this would fix the issue I decided to create a proper patch for iscsid that would make disabling default behavior optional, so I added a session configuration option called autoscan_en to control enabling the target node’s autoscan in iscsid. This way we can disable autoscan globally using iscsid.conf or do it on a per session basis in Cinder when the iscsiadm CLI tool present in the system supports it.

If you want to try the changed version of iscsid and iscsiadm but you don’t want to go through the trouble of compiling it, you can download them here and here (I have tested them in Fedora 23 and RHEL 7.3 environments). To use them just backup your existing files from /usr/sbin and copy these there, though you’ll probably need to stop the iscsid service before moving or replacing its binary as it’ll most likely be in use.

Summary

Some investigations has help us discover some existing issues with our current multipathing code, not only in Cinder but also in iscsid:

  • Flushing multipaths with friendly names.
  • Map in use when flushing multipaths on corner clases.
  • Doing too broad scans looking for new LUNs
  • Automatic iscsid scans on startup, login, and AEN/AER.

Fortunately enough we have solution for all these issues and we are already in the path to getting rid of them with currently proposed patches.


Picture: “Path” by Tim Green is licensed under CC BY 2.0


Leave a comment

Your email address will not be published. Required fields are marked *

3 thoughts on “iSCSI multipath issues in OpenStack

  • Igal

    Hello Gorka,
    Thanks for your in depth articles!
    I have a question regarding backup & restore volumes;
    During my testing, manually and also using tempest, I found that the restore times are much longer when using iSCSI protocol in comparison to FC connectivity to the Backend storage.
    Could it be because of multipathing messing something up?
    Do you have an idea why restore times are as long as 10 minutes for 1G volume…
    thanks,
    Igal

    • geguileo Post author

      Hi Igal,

      One must be very careful when making comparisons, as it’s really easy to end making unfair comparisons. For example it could be unfair to use different backups to restore, but if we use the same backup we have to be careful as well, because the storage array could be caching data and then give an unreasonable good result on all the tests except the first one where there was a cache miss.

      In general I don’t think there should be a big difference between both systems as long as we have equivalent connection speeds, both have multipath properly configured, backends are of similar efficiency (this will depend on the network connections of the storage array, the type of volume used in Cinder…).

      If we are using a compression algorithm in the backup and restore processes, defined with “backup_compression_algorithm” one must also take into consideration the CPU load of the server at the time of the backup/restore.

      Given all this I would first confirm that the backends are properly configured to do multipath (you can check this in the logs), change the order of the tests, and manually check the speed of both connections.

      Cheers,
      Gorka.

      • Igal

        Thanks for your Response Gorka,

        I have checked a couple of things and found that the there are lots of tcp retransmission’s.
        Using the command “netstat -s | grep retrans” I found that many packets were re-transmitted which led to suspicious physical connectivity.
        After changing some cabling and a switch the times of backup and restore become ‘normal’ (100s more or less),
        So these tests actually can point on a faulty infrastructure.

        Igal