OpenSolaris as Storage Server @ BizExpo

0

Posted by Jason | Posted in Solaris | Posted on 01-22-2008

I'm giving a presentation today at the Boise BizExpo about using OpenSolaris as the foundation for your storage infrastructure. It's a partial re-work of my ZFS talk at FORUM 2007, but focuses more on building a NAS/SAN platform out of OpenSolaris and less on ZFS in particular. If you'd like to see it in person, it's at 11:00am MST (01/23/2008) at the Boise BizExpo (Boise Center on the Grove, Room 4). Hopefully it's helpful to someone…and folks with forgive the stuttering. ;-)  The actual slides can be downloaded here: Liberating Storage with OpenSolaris 

Technorati Tags: , , , ,

SnapBack: The joys of backing up MySQL with ZFS…

17

Posted by Jason | Posted in Solaris | Posted on 01-17-2008

Awhile back (~July 2006) we moved our core MySQL clusters to ZFS in order to…among other things…simplify our backup regimen. Nearly two years later, I can honestly say I'm in love with ZFS mostly because of how much its simplified and shored-up our MySQL infrastructure. ZFS and MySQL go together like chips and salsa.

Now, backing up live MySQL servers is a bit like trying to take a picture of a 747's fan blades while it's in flight. There's basically three camps of MySQL backups:

 

  • Use mysqldump, which locks your tables longer than a Tokyo stoplight.
  • Quickly quiesce/lock your tables and pull a volume snapshot.
  • Or….use InnoBase Hot Backup. Outside of being licensed per server, ibbackup works very well and allows InnoDB activity to continue uninterrupted. The only downside is if you have to backup InnoDB tables (lock city!).

The mysqldump method worked best when our apps were in the development and testing phase…i.e. the datasets were tiny. However, once you get above a few hundred megabytes, the server had better be a non-production one. Frankly, for our infrastructure, the snapshot method was the most appealing. Among other reasons, it didn't require per-server licensing, and was the best performing option for our mixed bag of InnoDB and InnoDB tables. Initially, the plan was to use the snapshot feature in our shiny new STK arrays…however, just about the time our new arrays arrived, ZFS was released in a reasonably stable form. While snapshotting is not unique to ZFS (and it is a widely used approach to MySQL backups), there are a few benefits to relying on ZFS snapshots for MySQL backups:

  • ZFS snapshots are bleeding fast. When you're backing up a production master, the shorter lock time is critical. Our master backups typically complete within about 3-4 seconds during high load periods.
  • No network communication for the snapshot to complete. Using ZFS, snapshot commands don't have to travel over a network to a storage controller to be executed. Fewer moving parts mean greater reliability and fewer failure conditions to handle in your backup scripts.
  • 100GB database restores/rollbacks are lightning quick…typically under 30 seconds. (Unique to the snapshot approach…not ZFS).

However, settling on a backup approach was only part of the battle. Frankly, at the time, there was no commercial backup software that would do what we wanted. MySQL was the red-headed step-child of the backup software world…and to a large degree still is (Zmanda not withstanding). So we rolled our own MySQL/ZFS backup program that we called SnapBack. It's not fancy, and you need to pair it with your own scheduler, but it is highly suited to fast and reliable production backups of MySQL in a ZFS environment. We use it for all sorts of tasks from the original purpose of backing up busy masters, to snapping datasets to build new slaves. SnapBack addresses quite a few of the issues we encountered with the existing open source MySQL backup scripts we found:

  • SnapBack understands how to quiesce MySQL for consistent backups of InnoDB tables (usually avoiding InnoDB recovery on a restore). Most of the open source scripts focus exclusively on InnoDB, and forget about disabling AUTOCOMMIT.
  • SnapBack records the current master log file name and position in the naming of the snapshot (to aid creating replication slaves). Frankly, you can take any SnapBack backup and create a slave from that point-in-time. You don't really need to know you want to do that at the time you pull the backup.
  • SnapBack is aware that InnoDB logs and table space are usually on different zpools for performance, and can snap both pools in a single backup.

All this blathering is really just preface to the fact that we're releasing SnapBack to the world. Hopefully it will save other folks some time, and be a useful tool in their MySQL toolbox. Below you'll find the requirements for SnapBack and a download link. If there's interest in enhancing SnapBack, please feel free as we're releasing it under the BSD license. If there's enough interest, we'll try and post it in a format more conducive to community collaboration. So without further ado….

SnapBack Rev. 1524 Requirements 

Download: SnapBack Rev. 1524 

 

 

Back in the sandbox…ZFS flushing shenanigans revisted.

7

Posted by Jason | Posted in DigiTar, Solaris | Posted on 10-31-2007

Nearly a year has passed since our descent into the 9th ring of latency Hades, and I wanted to make an update post on ZFS' interaction with SAN arrays containing battery-backed cache. (For the full details, please check out this older post.)

For one thing, the instructions I previously gave to ignore cache flushes on the STK FLX200/300 series (and similar LSI OEM'd products), don't seem to work very well on the new generation Sun StorageTek 6×00 arrays. Not to mention it's kind of nasty to have to modify your array's NVRAM settings to get good write latency.

But thanks to the brilliant engineers on the ZFS team, you no longer have to modify your array (since circa May '07 in the OpenSolaris tree). Simply add this line to your Solaris /etc/system file and ZFS will no longer issue SYNCHRONIZE CACHE commands to your array:

set zfs:zfs_nocacheflush=1 

I can confirm that this works REALLY well on both the older (FLX200/300) and newer (6140/6540) Sun/Engenio arrays! It seems to me that since the new way is a ZFS configuration directive, it should be portable/functional against any array in existence. Please note that setting this directive will disable cache flushing for ALL zpools on the system, which would be dangerous for any zpools using local disks. As always, caveat emptor. Your mileage may vary so please do let others know through the comments what works/doesn't work for you.

P.S.
We've tested the zfs:zfs_nocacheflush directive successfully in Build 72 of OpenSolaris. It should also work in Solaris 10 Update 4, though we haven't tested that ourselves.

Technorati Tags: , ,

On the way to the FORUM….

1

Posted by Jason | Posted in Solaris | Posted on 10-05-2007

General

Today (10/10),  by the graciousness of Sun, I've been given the opportunity to speak on ZFS and all that its meant to us as a company. Overall, ZFS is truly an amazing and unique  technology that can transform any company in ways tailored to it. If you'd like to see the talk live, I'll be speaking again tomorrow (10/11) at Sun's FORUM 2007 located at the Adam's Mark Hotel in Denver, CO. The talk will be at 1:15PM MST. (I believe registration is required.)

If you happen to want to look at my blathering for free, I've uploaded my presentation (with talking notes) here.

BASH-like working directory prompt in PowerShell

0

Posted by | Posted in General | Posted on 09-07-2007

General General General

Most Linuxes (i.e. Gentoo) will set your BASH prompt to the current working directory…which for BASH means only displaying the name of the deepest directory you're sitting in. PowerShell also defaults its prompt to the current working directory, but for PS that means returning the full path. What do you do if the directory you're in is “C:Documents and SettingsTest UserMy DocumentsSubversion ProjectsProject 1Development Branch”? Well, it ain't exactly pretty. There's no option within the get-location cmdlet that will return a “short” working directory (the default PS prompt uses get-location). So kemosabe, I proffer up this solution to acquiring a BASH-like working directory prompt in PowerShell:

  1. Create a directory called %DefaultUserProfile%My DocumentsWindowsPowerShell
  2. Inside your shiny new directory, create a file called profile.ps1 and place this code inside (if you already have a profile.ps1, just add this code to it):

function prompt
{
 ”PS ” + (get-location).Path.Split(“”)[(get-location).Path.Split("").Length -1] + “>”
}

As they say in the land of Brie…c'est tout. That's all folks. :-) Hopefully, this helps someone equally frustrated.

Technorati Tags: , , , ,

Nagios Remote Plug-In Executor (NRPE) under SMF

0

Posted by Jason | Posted in DigiTar, Solaris | Posted on 02-22-2007

NRPE (Nagios Remote Plug-In Executor) is a critical part of a lot of IT environments. In ours it provides to Nagios all sorts of interesting health info local to the host that NRPE is running on. Whether its RAM, open connections, hard drive space or something else, NRPE helps alert you to strange happenings that simply interrogating a TCP port remotely won't provide. Hence, its a deal breaker to moving to OpenSolaris if you can't have it. Luckily, the benevolent gents at Blastwave provide
a pre-packaged NRPE that's ready to go (Run: pkg-get -i nrpe). Unfortunately, the Blastwave NRPE package leaves the last step of placing it under init.d or SMF control as an exercise for the admin. Well, if you're like me and would like SMF to be able to manage NRPE, then you're in luck. Below are a manifest and installation instructions that will start, stop and refresh an NRPE daemon (as installed from
the Blastwave package).

Its important to note that this NRPE manifest will expect your NRPE configuration to be in /opt/csw/etc/nrpe.cfg and that it will contain the line: pid_file=/var/run/nrpe.pid If your config file is in a different location, just edit method/nagios-nrpe in the manifest package to match where your nrpe.cfg lives. If for some reason you don't want to specify pid_file in your nrpe.cfg, then the refresh method
will not operate properly. The start and stop methods will operate whether you specify a pid_file value or not. Technically, just restarting the NRPE daemon will accomplish the same thing as the refresh method, which just sends a SIGHUP to the NRPE daemon. Again, caveat emptor. This manifest and the installation instructions below are provided with absolutely no warranty whatsoever as specified in the
BSD license in the manifest header.

To install the manifest please follow these steps:

  1. Download the NRPE manifest package here.
  2. Unpack the package on your system.
  3. Change to the root of the unpacked package.
  4. Run: cp ./manifest/nagios-nrpe.xml /var/svc/manifest/network/
  5. Run: cp ./method/nagios-nrpe /lib/svc/method/
  6. Run: svccfg import /var/svc/manifest/network/nagios-nrpe.xml
  7. You're done!

If everything went smoothly, running svcadm enable nrpe should start the daemon without incident. Similarly, svcadm disable nrpe should kill it. As mentioned before, there's also svcadm refresh nrpe, which will send a SIGHUP to NRPE. That will cause NRPE to re-read its nrpe.cfg file. An interesting note on refresh is that NRPE will reliably crash on a second SIGHUP. If you were using standard init.d,
this could really hose you, as NRPE would randomly terminate and you wouldn't know. With SMF however, it doesn't matter! If NRPE dies when you send it a SIGHUP, SMF will loyally restart the daemon for you. Another reason to use SMF with all of your critical services, where an automatic restart won't risk data corruption! Hope y'all find this of use!

Technorati Tags: , , ,

OpenSolaris & SMF adventures with PowerDNS

5

Posted by Jason | Posted in DigiTar, Solaris | Posted on 02-21-2007

One of the quiet parts that powers our logistics infrastructure is PowerDNS. Its a very powerful way to serve DNS records that you need the ability to update programmatically. Unfortunately, OpenSolaris (or Solaris 10 for that matter) isn't exactly considered kosher over in PowerDNS-land. Like a lot of OSS projects, PDNS hasn't kept up with the times and treats OpenSolaris like a red-headed step-child. If you like red-headed step-children like we do, then you're in for about
8 hours of greasing, coaxing and pleading to get it compiled right. Well either that…or you can read on and get it up in about a 30 minutes.. :-) As a side-bonus, you'll also have PDNS managed by the coolest way ever invented to replace init.d: SMF.

Installing PDNS on OpenSolaris/Solaris 10 x64…

First thing you'll need to do is get Blastwave installed on your Solaris box. You could try and build the unholy abomination that is Boost on your own…but then you're a braver soul than I. As its getting late, please excuse that the steps are brief and bulleted (feel free to harass me if you have questions):

  1. Make sure your path is set correctly. This path will do nicely: PATH=/usr/sbin:/usr/bin:/opt/csw/bin:/usr/sfw/bin:/usr/ccs/bin
  2. You'll need all the dev tools that come with a standard Solaris 10/OpenSolaris install…make, gcc, g++, ld etc. (You don't need Studio 11 installed. In fact, PDNS will really NOT like Studio 11 so please use gcc 3.3 or later).
  3. Run: pkg-get -i mysql5client
  4. Run: pkg-get -i mysql5devel
  5. Run: pkg-get -i boost_rt
  6. Run: pkg-get -i boost_devel
  7. Run: ln -s /opt/csw/mysql5/lib/mysql /usr/lib/mysql (This will make pathological configure scripts work a lot more smoothly.)
  8. Run: crle -l /lib:/usr/local/lib:/opt/csw/lib:/usr/lib:/opt/csw/mysql5/lib (This will help your compiled PDNS binaries find all the libraries they need at runtime. Run crle by itself first to see if there are any additional paths on your system that need to be present on this list. Caveat emptor..you run this command at your own risk as it can really bork your system if you don't know what you're doing.)
  9. Unpack the latest PDNS sources which you can get here (these instructions are known to work against 2.9.20).
  10. From within the PDNS source tree root run: ggrep -R “u_int8_t” *
  11. Manually change all the u_int8_t references that grep finds to uint8_t. If you don't do this, good 'ol crotchety PDNS will not compile. (I know I should provide a patch. I'll try and do that in the next couple of days if possible).
  12. From the PDNS source tree root run: ./configure –localstatedir=/var/run –with-pic –enable-shared –with-mysql-includes=/opt/csw/mysql5/include/ CXXFLAGS=”-I/opt/csw/include -DSOLARIS” LDFLAGS=”-L/opt/csw/lib -lsocket -lnsl”
  13. Run: make install (This will use the prefix /usr/local/ to install everything. The SMF manifest later will expect your pdns.conf to be in /usr/local/etc/ as a result. For sanity purposes on our systems, we also symlink pdns.conf into /etc.)
  14. Bingo! Presto! You have a working PDNS server…hopefully.

Life support for PDNS…that is running PDNS under SMF…

Service Management Facility (SMF) is a truly wonderful thing. It completely replaces init.d and inet.d, gives you a standard way of managing both types of services, understands dependencies, restarts dead services…and washes your car while you sleep. ;-) The only hiccough is you've got to write a manifest to run PDNS under SMF…or use the one below. :-D Again…caveat emptor…this SMF manifest comes with absolutely no warranty at all. Read the BSD license
header at the top of the manifest for a complete description of how much its your own darn fault if this manifest totals your system. The DigiTar SMF manifest for PDNS has a couple of neat integration features:

  • If PDNS is already started when you run svcadm enable powerdns, it will error out such that SMF will mark PDNS' service description into a maintenance state, and will place an informative message in the PDNS SMF service log.
  • If you accidentally delete the pdns_server binary, SMF will not let you start the service and will place it into a maintenance state so you know something is wrong.
  • Running svcadm refresh powerdns will instruct PDNS to scan for new domains that have been added (pdns_control rediscover), as well as rescan for changes to records in existing domains (pdns_control reload).

OK, enough jabbering. Here's how you install the SMF manifest:

  1. Download the DigiTar PowerDNS SMF package here.
  2. Unpack the package on your system.
  3. Change to the root of the unpacked package.
  4. Run: cp ./manifest/dns-powerdns.xml /var/svc/manifest/site/
  5. Run: cp ./method/dns-powerdns /lib/svc/method/
  6. Run: svccfg import /var/svc/manifest/site/dns-powerdns.xml
  7. You're done!

You should now be able to start your PDNS server with a simple svcadm enable powerdns. Stopping PDNS is similarly simple: svcadm disable powerdns. If you just want to see the state of the PDNS service try svcs powerdns. That's it! You can sleep well at night knowing if PDNS goes the way of all flesh, SMF will auto-restart it for you. Try a pkill pdns and watch the process IDs change. :-) If you're PDNS service won't start take a look at svcs
-x
to see why. Anywho…off to the sand man for me. If you have any questions, please feel free to contact me: williamsjj_@_digitar.com

Technorati Tags: , , ,

Shenanigans with ZFS flushing and intelligent arrays…

18

Posted by Jason | Posted in DigiTar, Solaris | Posted on 12-14-2006

Tags:

NOTE: ZFS has been enhanced to better address the situation described below by using ZFS configuration directives. This article is still accurate and provides decent background on the problem. However, an update has been posted with the newer, stronger, better way of resolving the problem: Back in the sandbox…ZFS flushing shenanigans revisited. :-)

Running operations for start-up company is interesting…you learn a lot of things the hard way. Among the things you learn is how nuanced it is to deal with databases and storage under heavy traffic. Before I start my little diatribe, let me profusely thank Richard Elling and Roch Bourbonnais at Sun for saving our bacon. They are stellar engineers and we are more grateful than words can say for their help in resolving our ZFS roadblocks. They’re another example of the people who make Sun the company we love
to work with. Sun is blessed to have them.

About 6 months ago we moved to a new Sun StorageTek FC array, and used the opportunity to move to ZFS. We loved ZFS in development and frankly it kicks the pants off of UFS/SVM (Solaris Volume Manager). It is SO much easier to deal with for volume management, and the block checksums help you quickly eliminate your storage when tracking down corruption. That being said, ZFS really has some interesting quirks. One of them is that it is truly designed to deal with dumb-as-a-rock storage. If you have a box of SATA
disks with firmware flakier than Paris Hilton on a coke binge, then ZFS has truly been designed for you.

As a result, ZFS doesn’t trust that anything it writes to the ZFS Intent Log (ZIL) made it to your storage, until it flushes the storage cache. After every write to the ZIL, ZFS executes an fsync() call to instruct the storage to flush its write cache to the disk. In fact, ZFS won’t return on a write operation until the ZIL write and flush have completed. If the devices making up your zpool are individual hard drives…particularly
SATA ones…this is a great behavior. If the power goes kaput during a write, you don’t have the problem that the write made it to drive cache but never to the disk.

The major problem with this strategy only occurs you when you try to layer ZFS over an intelligent storage array with a decent battery-backed cache. Enter our issues with ZFS on StorageTek/Engenio arrays.

Most of these arrays have sizable 2GB or greater caches with 72-hour batteries. The cache gives a huge performance boost, particularly on writes. Since cache is so much faster than disk, the array can tell the writer really quickly, “I’ve got it from here, you can go back to what you were doing”. Essentially, as fast as the data goes into the cache, the array can release the writer. Unlike the drive-based caches, the array cache has a 72-hour battery attached to it. So, if the array loses power and dies, you
don’t lose the writes in the cache. When the array boots back up, it flushes the writes in the cache to the disk. However, ZFS doesn’t know that its talking to an array, so it assumes that the cache isn’t trustworthy, and still issues an fsync() after every ZIL write. So every time a ZIL write occurs, the write goes into the array write cache, and then the array is immediately instructed to flush the cache contents to the disk. This means ZFS doesn’t get the benefit of a quick return from the
array, instead it has to wait the amount of time it takes to flush the write cache to the slow disks. If the array is under heavy load and the disks are thrashing away, your write return time (latency) can be awful with ZFS. Even when the array is idle, your latency with flushing is typically higher than the latency under heavy load with no flushing. With our array honoring ZFS ZIL flushes, we saw idle latencies of 54ms, and heavy load latencies of 224ms. This crushed the MySQL database running on top of it.
The InnoDB tables are particularly sensitive to this, because they issue 3x more writes than InnoDB tables. Also, since InnoDB tables use disk-based transactions, you can get write-loads that are orders of magnitudes greater than InnoDB. If the disk-latency gets bad enough, InnoDB will completely lock-up the MySQL process with deadlocks.

So where does this leave a hapless start-up? Fortunately, you don’t have to give up ZFS. You have two options to rid yourself of the bane of existence known as write cache flushing: *** Please check out the update to this article here! There’s a better way now! ***

  • Disable the ZIL. The ZIL is the way ZFS maintains consistency until it can get the blocks written to their final place on the disk. That’s why the ZIL flushes the cache. If you don’t have the ZIL and a power outage occurs, your blocks may go poof in your server’s RAM…’cause they never made it to the disk Kemosabe.
  • Tell your array to ignore ZFS’ flush commands. This is pretty safe, and massively beneficial.

The former option, is really a no go because it opens you up to losing data. The second option really works well and is darn safe. It ends up being safe because if ZFS is waiting for the write to complete, that means the write made it to the array, and if its in the array cache you’re golden. Whether famine or flood or a loose power cable come, your array will get that write to the disk eventually. So its OK to have the array lie to ZFS and release ZFS almost immediately after the ZIL flush command executes.
On our StorageTek FLX210 this took the idle latencies to 1ms and the heavy load latencies to 9ms. 9 bloody milliseconds! Our InnoDB problems disappeared like sand down a rat hole.

So how do you get your array to ignore SCSI flush commands from ZFS? That differs depending on the array, but I can tell you how to do it on an Engenio array. If you’ve got any of the following arrays, its made by Engenio and this may work for you:

  • Sun StorageTek FlexLine 200/300 series
  • Sun StorEdge 6130
  • Sun StorageTek 6140/6540
  • IBM DS4x00
  • many SGI InfiniteStorage arrays (you’ll need to check to make sure your array is actually OEM’d from Engenio)
  • (if you have another Engenio OEM’d array manufacturer, just let me know and I’ll update the list.)

Before I give you the instructions, I must warn you that the following instructions come with no warranty whatsoever. These instructions are from me alone and have no blessing conferred by, warranty from, acceptability by, or connection with my employer DigiTar. Neither I nor my employer can be held responsible for the consequences resulting from the use of these instructions, and if you use them you absolve us both individually and collectively from any responsibility for the accuracy of these instructions
or the consequences of using these instructions. These instructions are potentially dangerous and may cause massive data loss. Caveat Emptor.

Okay, tush-covering mumbo jumbo over. On a StorageTek FLX210 with SANtricity 9.15, the the following command script will instruct the array to ignore flush commands issued by Solaris hosts:

//Show Solaris ICS option

show controller[a] HostNVSRAMbyte[0x2, 0x21];

show controller[b] HostNVSRAMbyte[0x2, 0x21];

//Enable ICS

set controller[a] HostNVSRAMbyte[0x2, 0x21]=0×01;

set controller[b] HostNVSRAMbyte[0x2, 0x21]=0×01;

// Make changes effective

// Rebooting controllers

show “Rebooting A controller.”;

reset controller[a];

show “Rebooting B controller.”;

reset controller[b];

If you notice carefully, I said the script will cause the array to ignore flush commands from Solaris hosts. So all Solaris hosts attached to the array will have their flush commands ignored. You can’t turn this behavior on and off on a per host basis. To run this script, cut and paste the script into the script editor of the “Enterprise Management Window” of the SANtricity management GUI. That’s it! A key note here is that you should definitely have your server shut down, or at minimum
your ZFS zpool exported before you run this. Otherwise, when your array reboots ZFS will kernel panic the server. In our experience, this will happen even if you only reboot one controller at a time, waiting for one controller to come back online before rebooting the other. For whatever reason, MPXIO which normally works beautifully to keep a LUN available when losing a controller, fails miserably with this situation. Its probably the array’s fault, but whatever the issue, that’s the reality. Plan for downtime
when you do this.

In the words of the French, c’est tout…that’s all folks. This cleared up all of the ZFS latency problems we’ve been having. Hopefully, this experience will be helpful to other people. This behavior isn’t well documented outside of the ZFS mailing lists, which is why we’re documenting it here for the world to index and find. More importantly, public documentation on Engenio-based arrays is downright abysmal. If you search hard enough, you’ll
find an IBM Red Book paper that tells you the array can ignore flush commands, but happy hunting if you actually want to know how to enable the behavior.

Just a quick note before closing…ZFS rocks. Its that simple. So much arcane black magic disappears under the skirt of ZFS, but as always you can’t make it all go away. If anyone has instructions on how to configure non-Engenio arrays to ignore flush commands, please let me know. Stay tuned for a diatribe..er..discussion on the kernel panic behavior of ZFS. G’night y’all.

SunFish Chum…er…Odds and ends.

0

Posted by Jason | Posted in DigiTar, Technology | Posted on 08-15-2006

Currently, we're putting the N1400Vs into production and there were two odds and ends that came to mind that I wanted to mention:

  1. No client/server settings per port! Hooray! The Alteons (even the 2424s) inherited from the Alteon AD4s and 184s the need to enable client and/or server processing per port. For those who are not familiar, server load balancing basically can be reduced into two operations:
    • Client processing: When a packet comes in from a web browser to the web switch, its header has a TO fieldthat's the IP address of the web switch, and a FROM fieldthat's the IP address of the web browser. Once the web switch gets the packet and decides which back-end server to send it to, it has to replace the packet's TOwith the IP address of the back-end server. If the web switch didn't change the TOand simply sent the packet on, the server would ignore the packet. Sort
      of like receiving a letter addressed to somebody you don't know. So in a nutshell, server processing is simply replacing the web switch's IP address with the selected back-endserver's IP address in packets from the client.
    • Server processing: When the back-end server decides to send a response packet back to the client, the reverse of server processing has to occur. If the web switch were to simply send the packet from the server back to the client without client processing, the client would ignore the packet. Why? Well, the client sent the packet to the IP address of the web switch and expects a reply from that IP address, not the server's IP. It sort of like sending a letter to Aunt Gertie, but getting the reply
      from Aunt Gertie's nurse Josie. You don't know who Josie is, so you toss the reply thinking its junk mail. Client processing fixes this by rewriting the FROM in the server's reply to the IP address of the web switch.
    • An Alteon is a bit unusual in that instead of one massive SLB processor it has 8…one per port (this is fixed in the 2424s, but they imitate the older behavior for backward compatibility). So if you have one port connected to your servers and a second port connected to the Internet, you have to enable Client processing on the Internet-facing port and Server processing on the server-facing port. The reason is that the 8 individual processors aren't bulky enough to do BOTH the client and server processing. As
      a result, the operation gets split between ports in a way you specify. So you have to remember which kind of processingis which, and set it appropriately on the right ports. This is a MAJOR pain in the butt. If you get client and server processing confused and set a port to the wrong one, load balancing just isn't gonna work for you today.
    • The SunFish don't have this limitation. They just make it work. Concentrate on creating your VIPs and RIPs and the rest is taken care of for you. Its really a spectacular change for us! It was so easy, that it wasn't until I was driving home that it struck me I hadn't had to fool with client or server processing at all.
  2. XML-over-HTTP! As I was complaining about the lack of a heads-up-display on the SunFish, I ran into a very cool feature!On most of the pages that list settings or statistics in the SunFish WebUI, there's a little button labeled “XML”. If you click on it, you get the settings or stats you were looking at…but XML encoded! This means you can write your own scripts to consume the status of the SunFish! All your program needs to be capable of is downloading pages via HTTP, and consuming XML.
    The upshot is that this feature enables us to write our own stop gap heads-up-display. :-) Its much simpler than messing around with SNMP calls and the like. Particularly, given our familiarity with consuming web services. This is a terrific feature! Props to the SunFish team for providing an XML interface to the unit. Simply amazing.

Technorati Tags: ,

Take Two: SunFish takes incoming fire…

0

Posted by Jason | Posted in Technology | Posted on 08-08-2006



The biggest toughest fish in the sea…and load-balancer on earth…

Might as well call it CryptoFish…

Up until this point, we've never had the pleasure of using SLB-based SSL-offload. We've just trudged along…scaling the ol' SSL proxies along with the app servers. While cost-effective, the customer doesn't get Speedy Gonzales latencies, and the number of moving parts goes up. As you might guess from yesterday's entry, a main design goal for us is reducing moving parts. DigiTar believes y'all can't break stuff that doesn't exist. Needless to say, moving to SSL-offload in our load-balancers is something we're pretty keen on. Among other things, we'll be able re-purpose all those beefy SSL proxy servers for other things…like space heaters.

Enter today's first chore…migrate the HTTP SLB VIP to an HTTPS config leveraging the SSL-offload capabilities of the SunFish. While we've never configured SSL-offload before, given this review on the Alteon 2424-SSL, I was prepped for beastly day. In case you don't want to read the link, on a 2424-SSL, you're looking at a configuration task that involves redirection and 3 sets of filters on 3 different VLANs. This is all to get the traffic to and from the SSL card…which is more of a suckerfish (yeah the ones that feed on sharks) than a real part of the box.



An Alteon 2424 with its SSL card in tow…

So what's all this leading up to? We had SSL up on the SunFish in about 5 minutes! That's right, four clicks of the mouse and bang I was done. What's more…it worked! The hardest part was trying to pull the proxies' key and cert out of the Subversion repository. Here's the 4 steps to converting an existing SunFish HTTP SLB group to utilize SSL-offload goodness:

  1. Copy your existing key from its Apache PEM file, and paste it into the import page on the SunFish WebUI. Oh…you've also got to name it something clever. Alright, maybe not clever, but its gotta have a name.
  2. Grab the certificate portion from the aforementioned PEM, and paste into the import page on the SunFish WebUI…and select the name you assigned the key in the last step using the drop-down.
  3. Delete the existing HTTP virtual service.
  4. Re-create the HTTP virtual service as an HTTPS virtual service with all the same parameters as before, only this time select your cert and key from the “Certificate And Key Name” drop-down.

Its that easy. No redirection filters. No funky hidden VLANs. No nothin'…no kidding. There was so much time left-over, I spent 20 minutes figuring out how to do on-the-fly HTTP header modification…which also worked perfectly. The SunFish is so easy to use for crypto, its purchase price is probably justified by what we'll save doing things other than configuring Apache SSL proxies. If the 2424 wasn't dead on our purchase list by now, this was certainly a double-tap to its mainboard.

You need GigE?

Well we tried to setup redundancy…but the effort was a bit stillborn. One of the only complaints I've got about the physical aspects of the SunFish is that all of its ports are SFPs. As a result, to use RJ-45 cables you have to use GigE copper SFPs (which David C was very generous to provide). The problem is that an SFP has no concept of 10/100/1000. Its designed to be a gig optical port. As a result, we're pretty limited in our dev server room as to what we can hook it up to. In terms of the test, the X4100s are directly connected (they've got GigE ports), and the “Internet” port goes to a GigE-capable firewall we use. Yeah…I know. Its weird. We have a GigE firewall but no GigE switches in our dev lab. Alas, that means we can't test the redundancy…but that doesn't mean I can't pontificate about it. :-)

The SunFish definitely has Alteon-rising when it comes to redundancy. No elegant custom protocol that binds two units as one. Rather, they use VRRP to tie the interfaces of redundant SunFish together, and an unnatural progeny of VRRP they call VSRP to handle failover of SLB virtual services between units. All-in-all darn identical to the Alteon…almost.

Like the Alteon, VRRP has to be set up separately on each interface you want failover for. That's a royal pain in the keester. However, VSRP is a little different, and lot better than the Alteon's way of hacking VRRP to handle virtual service failover. With the Alteon you have to configure a VRRP instance for each SLB virtual service…just like the interfaces. When you're running close to 15 virtual services per web switch, this is beyond a royal pain. Its excruciating. Rube Goldberg could design a better way.

The SunFish does it much better. You just turn VSRP on. Yes…that's just about it. Of course, you have to set which unit in a pair is the master and which is the backup…but that's all you're required to do. No matter how many virtual services across a disparate number of vSwitches might catch your fancy, you only have one VSRP instance to enable per web switch. For us that's incredible! As much as I would prefer a custom protocol that makes the entire process transparent, I'll take a SunFish thank you very much.

Feelings thus far…

So far I've been incredibly impressed by the SunFish. It is by far the most robust and powerful web switch a kid with a shiny nickel can buy (OK…500,000 shiny nickels)…without buying its big brother. There isn't one thing so far that wouldn't allow the SunFish to surpass our needs in the following areas:

  • Consolidating a gaggle of web switches down to 2. Securely.
  • Putting a brigade of SSL proxy servers out of commission.
  • Giving us more power than we'll ever need in a single rack. Heck, this baby will run multiple processing silos for us without breaking a sweat.

The one area I can't really stress enough is the amount of flexibility the SunFish will give us. The raw power combined with the virtualization capabilities will only increase our ability to deliver mind-blowing solutions to our customers (and keep the price lower than our competitors). Now the task is to put these boxes into a more production environment. More on this soon. :-)

P.S.
We still want our AlteonEMS for SunFish! This would make the SunFish far and above the best web switch period. Yeah F5 may have its iRules…but you put a few of those on one of their boxes and the gear keels over. To meet our needs with F5 gear we'd need to buy enough to finance a fleet of Ferraris. SunFish or Alteons baby. Power over looks. :-)

Technorati Tags: ,