Shenanigans with ZFS flushing and intelligent arrays…

18

Posted by Jason | Posted in DigiTar, Solaris | Posted on 12-14-2006

Tags:

NOTE: ZFS has been enhanced to better address the situation described below by using ZFS configuration directives. This article is still accurate and provides decent background on the problem. However, an update has been posted with the newer, stronger, better way of resolving the problem: Back in the sandbox…ZFS flushing shenanigans revisited. :-)

Running operations for start-up company is interesting…you learn a lot of things the hard way. Among the things you learn is how nuanced it is to deal with databases and storage under heavy traffic. Before I start my little diatribe, let me profusely thank Richard Elling and Roch Bourbonnais at Sun for saving our bacon. They are stellar engineers and we are more grateful than words can say for their help in resolving our ZFS roadblocks. They’re another example of the people who make Sun the company we love
to work with. Sun is blessed to have them.

About 6 months ago we moved to a new Sun StorageTek FC array, and used the opportunity to move to ZFS. We loved ZFS in development and frankly it kicks the pants off of UFS/SVM (Solaris Volume Manager). It is SO much easier to deal with for volume management, and the block checksums help you quickly eliminate your storage when tracking down corruption. That being said, ZFS really has some interesting quirks. One of them is that it is truly designed to deal with dumb-as-a-rock storage. If you have a box of SATA
disks with firmware flakier than Paris Hilton on a coke binge, then ZFS has truly been designed for you.

As a result, ZFS doesn’t trust that anything it writes to the ZFS Intent Log (ZIL) made it to your storage, until it flushes the storage cache. After every write to the ZIL, ZFS executes an fsync() call to instruct the storage to flush its write cache to the disk. In fact, ZFS won’t return on a write operation until the ZIL write and flush have completed. If the devices making up your zpool are individual hard drives…particularly
SATA ones…this is a great behavior. If the power goes kaput during a write, you don’t have the problem that the write made it to drive cache but never to the disk.

The major problem with this strategy only occurs you when you try to layer ZFS over an intelligent storage array with a decent battery-backed cache. Enter our issues with ZFS on StorageTek/Engenio arrays.

Most of these arrays have sizable 2GB or greater caches with 72-hour batteries. The cache gives a huge performance boost, particularly on writes. Since cache is so much faster than disk, the array can tell the writer really quickly, “I’ve got it from here, you can go back to what you were doing”. Essentially, as fast as the data goes into the cache, the array can release the writer. Unlike the drive-based caches, the array cache has a 72-hour battery attached to it. So, if the array loses power and dies, you
don’t lose the writes in the cache. When the array boots back up, it flushes the writes in the cache to the disk. However, ZFS doesn’t know that its talking to an array, so it assumes that the cache isn’t trustworthy, and still issues an fsync() after every ZIL write. So every time a ZIL write occurs, the write goes into the array write cache, and then the array is immediately instructed to flush the cache contents to the disk. This means ZFS doesn’t get the benefit of a quick return from the
array, instead it has to wait the amount of time it takes to flush the write cache to the slow disks. If the array is under heavy load and the disks are thrashing away, your write return time (latency) can be awful with ZFS. Even when the array is idle, your latency with flushing is typically higher than the latency under heavy load with no flushing. With our array honoring ZFS ZIL flushes, we saw idle latencies of 54ms, and heavy load latencies of 224ms. This crushed the MySQL database running on top of it.
The InnoDB tables are particularly sensitive to this, because they issue 3x more writes than InnoDB tables. Also, since InnoDB tables use disk-based transactions, you can get write-loads that are orders of magnitudes greater than InnoDB. If the disk-latency gets bad enough, InnoDB will completely lock-up the MySQL process with deadlocks.

So where does this leave a hapless start-up? Fortunately, you don’t have to give up ZFS. You have two options to rid yourself of the bane of existence known as write cache flushing: *** Please check out the update to this article here! There’s a better way now! ***

  • Disable the ZIL. The ZIL is the way ZFS maintains consistency until it can get the blocks written to their final place on the disk. That’s why the ZIL flushes the cache. If you don’t have the ZIL and a power outage occurs, your blocks may go poof in your server’s RAM…’cause they never made it to the disk Kemosabe.
  • Tell your array to ignore ZFS’ flush commands. This is pretty safe, and massively beneficial.

The former option, is really a no go because it opens you up to losing data. The second option really works well and is darn safe. It ends up being safe because if ZFS is waiting for the write to complete, that means the write made it to the array, and if its in the array cache you’re golden. Whether famine or flood or a loose power cable come, your array will get that write to the disk eventually. So its OK to have the array lie to ZFS and release ZFS almost immediately after the ZIL flush command executes.
On our StorageTek FLX210 this took the idle latencies to 1ms and the heavy load latencies to 9ms. 9 bloody milliseconds! Our InnoDB problems disappeared like sand down a rat hole.

So how do you get your array to ignore SCSI flush commands from ZFS? That differs depending on the array, but I can tell you how to do it on an Engenio array. If you’ve got any of the following arrays, its made by Engenio and this may work for you:

  • Sun StorageTek FlexLine 200/300 series
  • Sun StorEdge 6130
  • Sun StorageTek 6140/6540
  • IBM DS4x00
  • many SGI InfiniteStorage arrays (you’ll need to check to make sure your array is actually OEM’d from Engenio)
  • (if you have another Engenio OEM’d array manufacturer, just let me know and I’ll update the list.)

Before I give you the instructions, I must warn you that the following instructions come with no warranty whatsoever. These instructions are from me alone and have no blessing conferred by, warranty from, acceptability by, or connection with my employer DigiTar. Neither I nor my employer can be held responsible for the consequences resulting from the use of these instructions, and if you use them you absolve us both individually and collectively from any responsibility for the accuracy of these instructions
or the consequences of using these instructions. These instructions are potentially dangerous and may cause massive data loss. Caveat Emptor.

Okay, tush-covering mumbo jumbo over. On a StorageTek FLX210 with SANtricity 9.15, the the following command script will instruct the array to ignore flush commands issued by Solaris hosts:

//Show Solaris ICS option

show controller[a] HostNVSRAMbyte[0x2, 0x21];

show controller[b] HostNVSRAMbyte[0x2, 0x21];

//Enable ICS

set controller[a] HostNVSRAMbyte[0x2, 0x21]=0×01;

set controller[b] HostNVSRAMbyte[0x2, 0x21]=0×01;

// Make changes effective

// Rebooting controllers

show “Rebooting A controller.”;

reset controller[a];

show “Rebooting B controller.”;

reset controller[b];

If you notice carefully, I said the script will cause the array to ignore flush commands from Solaris hosts. So all Solaris hosts attached to the array will have their flush commands ignored. You can’t turn this behavior on and off on a per host basis. To run this script, cut and paste the script into the script editor of the “Enterprise Management Window” of the SANtricity management GUI. That’s it! A key note here is that you should definitely have your server shut down, or at minimum
your ZFS zpool exported before you run this. Otherwise, when your array reboots ZFS will kernel panic the server. In our experience, this will happen even if you only reboot one controller at a time, waiting for one controller to come back online before rebooting the other. For whatever reason, MPXIO which normally works beautifully to keep a LUN available when losing a controller, fails miserably with this situation. Its probably the array’s fault, but whatever the issue, that’s the reality. Plan for downtime
when you do this.

In the words of the French, c’est tout…that’s all folks. This cleared up all of the ZFS latency problems we’ve been having. Hopefully, this experience will be helpful to other people. This behavior isn’t well documented outside of the ZFS mailing lists, which is why we’re documenting it here for the world to index and find. More importantly, public documentation on Engenio-based arrays is downright abysmal. If you search hard enough, you’ll
find an IBM Red Book paper that tells you the array can ignore flush commands, but happy hunting if you actually want to know how to enable the behavior.

Just a quick note before closing…ZFS rocks. Its that simple. So much arcane black magic disappears under the skirt of ZFS, but as always you can’t make it all go away. If anyone has instructions on how to configure non-Engenio arrays to ignore flush commands, please let me know. Stay tuned for a diatribe..er..discussion on the kernel panic behavior of ZFS. G’night y’all.