NOTE: ZFS has been enhanced to better address the situation described below by using ZFS configuration directives. This article is still accurate and provides decent background on the problem. However, an update has been posted with the newer, stronger, better way of resolving the problem: Back in the sandbox…ZFS flushing shenanigans revisited.
Running operations for start-up company is interesting…you learn a lot of things the hard way. Among the things you learn is how nuanced it is to deal with databases and storage under heavy traffic. Before I start my little diatribe, let me profusely thank Richard Elling and Roch Bourbonnais at Sun for saving our bacon. They are stellar engineers and we are more grateful than words can say for their help in resolving our ZFS roadblocks. They’re another example of the people who make Sun the company we love
to work with. Sun is blessed to have them.
About 6 months ago we moved to a new Sun StorageTek FC array, and used the opportunity to move to ZFS. We loved ZFS in development and frankly it kicks the pants off of UFS/SVM (Solaris Volume Manager). It is SO much easier to deal with for volume management, and the block checksums help you quickly eliminate your storage when tracking down corruption. That being said, ZFS really has some interesting quirks. One of them is that it is truly designed to deal with dumb-as-a-rock storage. If you have a box of SATA
disks with firmware flakier than Paris Hilton on a coke binge, then ZFS has truly been designed for you.
As a result, ZFS doesn’t trust that anything it writes to the ZFS Intent Log (ZIL) made it to your storage, until it flushes the storage cache. After every write to the ZIL, ZFS executes an fsync() call to instruct the storage to flush its write cache to the disk. In fact, ZFS won’t return on a write operation until the ZIL write and flush have completed. If the devices making up your zpool are individual hard drives…particularly
SATA ones…this is a great behavior. If the power goes kaput during a write, you don’t have the problem that the write made it to drive cache but never to the disk.
The major problem with this strategy only occurs you when you try to layer ZFS over an intelligent storage array with a decent battery-backed cache. Enter our issues with ZFS on StorageTek/Engenio arrays.
Most of these arrays have sizable 2GB or greater caches with 72-hour batteries. The cache gives a huge performance boost, particularly on writes. Since cache is so much faster than disk, the array can tell the writer really quickly, “I’ve got it from here, you can go back to what you were doing”. Essentially, as fast as the data goes into the cache, the array can release the writer. Unlike the drive-based caches, the array cache has a 72-hour battery attached to it. So, if the array loses power and dies, you
don’t lose the writes in the cache. When the array boots back up, it flushes the writes in the cache to the disk. However, ZFS doesn’t know that its talking to an array, so it assumes that the cache isn’t trustworthy, and still issues an fsync() after every ZIL write. So every time a ZIL write occurs, the write goes into the array write cache, and then the array is immediately instructed to flush the cache contents to the disk. This means ZFS doesn’t get the benefit of a quick return from the
array, instead it has to wait the amount of time it takes to flush the write cache to the slow disks. If the array is under heavy load and the disks are thrashing away, your write return time (latency) can be awful with ZFS. Even when the array is idle, your latency with flushing is typically higher than the latency under heavy load with no flushing. With our array honoring ZFS ZIL flushes, we saw idle latencies of 54ms, and heavy load latencies of 224ms. This crushed the MySQL database running on top of it.
The InnoDB tables are particularly sensitive to this, because they issue 3x more writes than InnoDB tables. Also, since InnoDB tables use disk-based transactions, you can get write-loads that are orders of magnitudes greater than InnoDB. If the disk-latency gets bad enough, InnoDB will completely lock-up the MySQL process with deadlocks.
So where does this leave a hapless start-up? Fortunately, you don’t have to give up ZFS. You have two options to rid yourself of the bane of existence known as write cache flushing: *** Please check out the update to this article here! There’s a better way now! ***
The former option, is really a no go because it opens you up to losing data. The second option really works well and is darn safe. It ends up being safe because if ZFS is waiting for the write to complete, that means the write made it to the array, and if its in the array cache you’re golden. Whether famine or flood or a loose power cable come, your array will get that write to the disk eventually. So its OK to have the array lie to ZFS and release ZFS almost immediately after the ZIL flush command executes.
On our StorageTek FLX210 this took the idle latencies to 1ms and the heavy load latencies to 9ms. 9 bloody milliseconds! Our InnoDB problems disappeared like sand down a rat hole.
So how do you get your array to ignore SCSI flush commands from ZFS? That differs depending on the array, but I can tell you how to do it on an Engenio array. If you’ve got any of the following arrays, its made by Engenio and this may work for you:
Before I give you the instructions, I must warn you that the following instructions come with no warranty whatsoever. These instructions are from me alone and have no blessing conferred by, warranty from, acceptability by, or connection with my employer DigiTar. Neither I nor my employer can be held responsible for the consequences resulting from the use of these instructions, and if you use them you absolve us both individually and collectively from any responsibility for the accuracy of these instructions
or the consequences of using these instructions. These instructions are potentially dangerous and may cause massive data loss. Caveat Emptor.
Okay, tush-covering mumbo jumbo over. On a StorageTek FLX210 with SANtricity 9.15, the the following command script will instruct the array to ignore flush commands issued by Solaris hosts:
//Show Solaris ICS option
show controller[a] HostNVSRAMbyte[0x2, 0x21];
show controller[b] HostNVSRAMbyte[0x2, 0x21];
//Enable ICS
set controller[a] HostNVSRAMbyte[0x2, 0x21]=0×01;
set controller[b] HostNVSRAMbyte[0x2, 0x21]=0×01;
// Make changes effective
// Rebooting controllers
show “Rebooting A controller.”;
reset controller[a];
show “Rebooting B controller.”;
reset controller[b];
If you notice carefully, I said the script will cause the array to ignore flush commands from Solaris hosts. So all Solaris hosts attached to the array will have their flush commands ignored. You can’t turn this behavior on and off on a per host basis. To run this script, cut and paste the script into the script editor of the “Enterprise Management Window” of the SANtricity management GUI. That’s it! A key note here is that you should definitely have your server shut down, or at minimum
your ZFS zpool exported before you run this. Otherwise, when your array reboots ZFS will kernel panic the server. In our experience, this will happen even if you only reboot one controller at a time, waiting for one controller to come back online before rebooting the other. For whatever reason, MPXIO which normally works beautifully to keep a LUN available when losing a controller, fails miserably with this situation. Its probably the array’s fault, but whatever the issue, that’s the reality. Plan for downtime
when you do this.
In the words of the French, c’est tout…that’s all folks. This cleared up all of the ZFS latency problems we’ve been having. Hopefully, this experience will be helpful to other people. This behavior isn’t well documented outside of the ZFS mailing lists, which is why we’re documenting it here for the world to index and find. More importantly, public documentation on Engenio-based arrays is downright abysmal. If you search hard enough, you’ll
find an IBM Red Book paper that tells you the array can ignore flush commands, but happy hunting if you actually want to know how to enable the behavior.
Just a quick note before closing…ZFS rocks. Its that simple. So much arcane black magic disappears under the skirt of ZFS, but as always you can’t make it all go away. If anyone has instructions on how to configure non-Engenio arrays to ignore flush commands, please let me know. Stay tuned for a diatribe..er..discussion on the kernel panic behavior of ZFS. G’night y’all.
Leon Koll
January 31st, 2007 at 3:19 pm
Thanks for the great posting!Matthew Ahrens
February 2nd, 2007 at 4:28 pm
FYI, I think what you call "fsync()" is actually the "flush write cache scsi command". Also, you should not turn off the zil. If your storage device has a non-volatile order-preserving cache, then you can safely turn off the flush write cache command by setting zfs_nocacheflush=1 in /etc/system.Albert Chin
February 24th, 2007 at 6:22 pm
What type of ZFS config do you have? We have a 6140 controller shelf and 6140 expansion shelf. We're interested in running ZFS on this, but the 6140 isn't JBOD. You need to configure volumes as RAID 0, 1, 3, 5, or 1+0. Because we want RAID-6, we'll do ZFS RAID-Z2. And, as we're forced to do some type of RAID on this device, we're thinking of creating separate RAID 0 volumes for *every* drive, essentially making a RAID JBOD config for ZFS.Curious what you are doing.
SD
February 25th, 2007 at 1:21 am
any tips on using Santricity with a Sun STK 6130: do you need specific firmware on 6130 to have SANtricity recognizing the this particular storage?thanks
sd
CJ Keist
April 11th, 2007 at 12:55 pm
Thank you for this write up. Our setup is a Sunfire T1000 connected via FC to an Apple XRaid using ZFS. Were were seeing very poor performance over NFS. A folder of about 57MB with 115 files would take over two minutes to copy over using "cp -r".
.After disabling the host cache flush settings on the XRaids, that same copy now only takes 2.59 seconds!! Disabling the host cache flush is a simple thing to do with Apple XRaid, all done through a GUI of course
We were pointed to your blog via a Sun Engineer, so I also tip my hat off to Sun Support Engineers!!!
admin
May 7th, 2007 at 10:36 pm
Geez…I guess I haven't checked the comments in WAY too long.Matt: That's terrific! I don't think that tunable was there in the Nevada build we were using at the time. That would have been a life-saver. Would've kept the array from rebooting.
Albert: At the time were were running RAID-Z over RAID-5. We've since moved the databases that were on the FLX210 to a Thumper using RAID-Z2. Frankly, if I had to do it on the FLX210 again, I'd do RAID-Z2 over a bunch of RAID-0 volumes (one disk per volume). We've got a 6140 in the dev lab that we're doing that with and getting some sweet performance.
SD: I'm not sure about the 6130. I can say that the Sun Common Array manager is a bloated pain in the rear. If you download the latest Storage Manager (aka Santricity) from IBM (9.22 I think at this time) it will work like a charm with any Engenio storage thats out. Works great with the 6140 since Sun doesn't ship Santricity with it.
admin
May 7th, 2007 at 10:38 pm
CJ: Sun Support Engineers rock! I'm glad this was helpful to you, makes my stupidity seem worth it.
We used to run Xserve RAIDs, and replaced them with the Engenio kit (and now Thumpers). If we'd had ZFS then (2004), the controller reliability issues wouldn't have mattered as much. Again, glad this was of benefit to someone.
Grant
May 23rd, 2007 at 7:57 pm
Anyone know how I can tell my 6140 controller to ignore the zfs flushes? I tried setting zfs:zfs_nocacheflush in /etc/system, but it wasn't recognized by my kernel (Solaris 10 11/06). Thanks.Ulrich Gräf
June 29th, 2007 at 1:02 pm
zfs:zfs_nocacheflush is supported in Nevada and will probably be in the next S10 update.Unfortunately it swiches off the »flush write cache scsi command« for all disks in the system – which is a bad idea if you have also zpools on local disks with write caches, like the newer SAS or SATA disks.
These commands help a lot if you have this mixture.
Andy Lubel
September 5th, 2007 at 8:47 am
cant seem to use the IBM storage manager for managing the 6130, at least the package i saw, it wasnt generic and was looking for a ds4100.. can you tell us in more detail how to get this download from IBM?tim
April 20th, 2008 at 3:57 pm
>any tips on using Santricity with a Sun STK 6130: do you need specific firmware on 6130 to have SANtricity recognizing the this particular storage?You can't use SANtricity with a 6130 because it is not an Engenio box. STORade or CAM only.
Raj
May 3rd, 2008 at 7:40 am
I have a disk array unit of 24 1T disks with 512MB cache and battery backup. The unit is from infortrend and I have created 1 big LUN of Raid6(h/w raid in infortrend) and attached to a solaris 10 machine and am using single LUN zfs pool. Doing this I am seeing very high IO performance compared with using ufs fs.I am not using Raid-Z since OS only sees 1 big disk.
Is it possible to get much better performance if I make 1 LUN per disks and share all the 24 disks in the unit as 24 LUNs and then make zfs pool of Raid-Z using these 24 LUNs?
Brett Morrow
July 2nd, 2008 at 8:52 am
With the infortrend units we saw worse performance with zfs. The problem lies in the cache flush and Infotrend confirmed it can not be turned off. I asked them to make a not about possible including this. We did get a lot better performance with adding this to /etc/system:set noexec_user_stack = 1
set noexec_user_stack_log = 1
* NOTE: Cache flushing is commonly done as part of the ZIL operations.
* While disabling cache flushing can, at times, make sense, disabling the
*ZIL does not.
* If you tune this parameter, please reference this URL in shell
* script or in an /etc/system comment.
* http://www.solarisinternals...
set zfs:zfs_nocacheflush = 1
Rejean
October 7th, 2008 at 5:03 pm
Solaris 10 11/06 and Solaris Nevada (snv_52) ReleasesSet dynamically:
echo zfs_nocacheflush/W0t1 | mdb -kw
Revert to default:
echo zfs_nocacheflush/W0t0 | mdb -kw
Set the following parameter in the /etc/system file:
set zfs:zfs_nocacheflush = 1
Risk: Some storage might revert to working like a JBOD disk when their battery is low, for instance. Disabling the caches can have adverse effects here. Check with your storage vendor.
[edit] Earlier Solaris Releases
Set the following parameter in the /etc/system file:
set zfs:zil_noflush = 1
Set dynamically:
echo zil_noflush/W0t1 | mdb -kw
Revert to default:
echo zil_noflush/W0t0 | mdb -kw
Risk: Some storage might revert to working like a JBOD disk when their battery is low, for instance. Disabling the caches can have adverse effects here. Check with your storage vendor.
the second method is working.
source:
http://www.solarisinternals...
Wang
January 14th, 2009 at 10:35 pm
Thanks for the useful article!In you article: “*** Please check out the update to this article [here]! There’s a better way now! ***”
But I can’t find the “better way” in that link, please point a way to me
Thank you very much!
Jason
January 15th, 2009 at 12:33 pm
Hi Wang, thank you for catching that. Just moved the blog from Nucleus to WordPress a few weeks ago, and it seems some of the old URLs are broken. I’ve fixed them in this entry, so it should work now.John
September 12th, 2009 at 3:29 am
Hi There, we have a 6140 with expansion tray, 32×450GB 15000K drives attached to 2×5240’s 1.4GHz. I asked the project for a week of performance testing but was turned down. At the time I believed it made sense to allow the built in raid controllers to do the work and created 4xhardware-RAID6 LUNs and presented two luns to each T5240 in RAID0.Each host runs 8 Oracle 10G databases and 3×10gAS app servers, all with probably different load requirements. DBA’s can’t tell me those requirements though. What I can tell you is the machines have plenty of cpu and approx 10GB real memory of head room with all apps running if I wanted the server cpu’s to do the raidz2 crunching…
Has anyone ever compared a 6140 with each disk presented as a LUN with raidz2 (mentioned above), to a 6140 with hardware raid6?
Jason
September 24th, 2009 at 4:45 pm
On a 6140 I’m not sure. On an older StorageTek FLX280 ZFS was a lot faster. I believe both are xScale procs.