Riak vs. CouchDB for storing 100,000+ coupons…


Posted by Jason | Posted in DigiTar, NoSQL, Software Development | Posted on 03-07-2011

Tags: , , , , , ,

We’ve been using both CouchDB and Riak for sometime now on a number of our APIs and user-facing services. When CouchDB wins out over Riak it’s usually for two reasons:

  • secondary indexes
  • multi-data center replication (would be great if Basho would open source this)

Both Riak and CouchDB excel at storing records under a primary key, but sometimes you need to know a different axis on the data. For example, all of our records are stored in JSON dictionaries and at times we want to know all the records that match a particular field in the dictionaries. That’s the situation we’re in for a new service we’ll be standing up soon. We wanted to generate coupon codes that customers could redeem for service as an alternative to providing a credit card. The coupon codes can be sold through affiliates, so one of the axes we’ll need to know in addition to the coupon code is what are all the coupon codes belonging to affiliate ID X?

One approach (without using secondary indexes) would be to try and encode both the code and the affiliate ID in the record’s key (e.g. prefix_code_affiliate_id). The main issue with that approach is that some of our access patterns don’t have access to the affiliate ID (e.g. a user signing up for service), they only know the coupon code. So we need fast lookup of the record based on the code alone. That pretty much eliminated map/reducing for the coupon code, and firmly established the code alone as the right choice for the data key. The perfect solution would be a secondary index on the “affiliate_id” of the JSON dictionary…in other words, a map of affiliate ID to the data. Normally, this is where we’d turn to CouchDB’s views and call it a day. But we’re planning on having millions of coupons in the system with thousands of parallel accesses that need to be evenly loaded across the datastore…not a scenario where CouchDB excels. Riak would be the perfect choice…except there’s no native secondary indexes.

Doing indexes in Riak

There’s a couple ways you can do your own indexes in Riak. The simplest approach is to just create an index key of the form idx_<field_name>_<field_value> and shove in a JSON list of the keys containing matching records. What you’ll run into very quickly is multiple clients trying to update that index key with new records, and overwriting each other. Since Riak keeps multiple versions in the event of a conflict, you can code your clients to auto-merge conflicted versions into one master list and re-post the index. But…that puts a lot of maintenance logic in your clients, and in the event one of those updates is deleting a key from the index, the merge process can put the deleted key back into the index.

Since we don’t want index management to have that many moving parts we came up with a different approach:

  • For every field being indexed on a particular record, create a new separate index key in a dedicated index bucket.
  • Store the indexed field’s name in the bucket name for the index key and store the indexed value and data key’s name in the name of the index key.
  • MapReduce to get a list of matching index keys for any particular question by iterating over the keys of an index’s bucket and splitting the index key name apart to analyze the value.

The format of an index key becomes (we use key prefixes to namespace different data key types in the same bucket):

Bucket Name: idx=<field_name>=<data_key_prefix>
Key Name: <data_key_name>/<field_value>
Key Value: empty

The immediate advantage of having an index key for every indexed field of a data key is the reduced chance of write conflicts when maintaining the index (you pretty much eliminate the chance a deleted index key is going to get resurrected). Asking the question “How many coupons have a redeemed count < 50?” becomes a simple MapReduce job that iterates over the idx=redeemed_cnt=coupon index bucket to find index keys where the field_value is < 50.

You might have noticed that we don’t store any data in the value of the index key… That’s on purpose, because it allows us to leverage a new feature of Riak 0.14…key filters for MapReduce jobs.

Key filters

The index system described so far would work fine on any key/value store with support for MapReduce. However, one problem is that every key in the bucket has to be analyzed by the Javascript map and reduce phases to determine if it matches the question (i.e. is this indexed value < 50). The problem is one of optimizing query performance. It takes Riak more time to run a user-supplied Javascript function to see if a key matches than it would take for Riak to analyze the index key itself.

Luckily the smart folks at Basho gave us a new tool to do just that with key filters. By encoding the indexed value in the key name we can tell Riak via a key filter to:

  1. Tokenize the index key name using “/” as the separator.
  2. Look at the second token after the split (i.e. the indexed value).
  3. Treat that token as an integer.
  4. Only give the index key to the MapReduce job if the integer value is < 50.

In fact, with key filters we actually don’t have to write our own MapReduce phases to answer this question anymore. All we have to do is construct the key filter, and tell Riak to use the “identity reduce” reduce phase that’s built in (skip the map phase entirely). What we’ll get back is a list of index keys whose indexed value is < 50. We can then split those index key names in our client to get the key names of the data keys they map to.

Rubber meets the road…

So what does the performance look like with all of this? We wrote a couple of tests using Twisted Python to benchmark loading 100,000 coupons into CouchDB and Riak and then asking both how many of those coupons had a redeemed count < 50. Here’s a legend for what the different test titles mean:

  • CouchDB 1.0.2 (cold view): The amount of time it takes the view (index) to answer the question the first time the view is queried. This number is important because CouchDB doesn’t build the view until you query it the first time. From then on it just incrementally updates the view with changed values.
  • CouchDB 1.0.2 (computed/warm view): Amount of time it takes the view to answer the question on subsequent queries after the view has been computed initially.
  • Riak 0.14.1 (Raw MapReduce) 1-node: No indexes used as described above. A brute force MapReduce job that iterates over the data keys and examines the reduced_count field in the JSON. 1-node Riak “cluster”.
  • Riak 0.14.1 (Indexed w/ Raw MapReduce) 1-node: Using index keys as described above, but using Javascript MapReduce phases on the index bucket to produce the matching key list…no key filters used. 1-node Riak “cluster”.
  • Riak 0.14.1 (Indexed w/ Key Filter MR) 1-node: Using index keys as described, but with key filters to reduce the input and a simple Javascript map phase to reformat the output (this would be a JS reduce phase except Riak has a bug right now with MapReduce jobs that have only a JS reduce phase). 1 -node Riak “cluster.
  • Riak 0.14.1 (Indexed w/ Raw MapReduce) 4-node: Same as “Indexed w/ RawMapReduce” above except done on a 4-node Riak cluster.
  • Riak 0.14.1 (Indexed w/ Key Filter MR) 4-node: Same as “Indexed w/ Key Filter MR” above except done on a 4-node Riak cluster.

Before I show the numbers, you’d probably like to know what the test setup looked like. Each node was a SoftLayer Cloudlayer server with these specs (if you haven’t tried them, SoftLayer is really a phenomenal provider):

  • 1x 2.0GHz Xeon CPU
  • 2GB RAM
  • 25GB HDD
  • Gigabit NICs
  • Ubuntu 10.04.1 64-bit
  • Dallas 05 Datacenter
  • CouchDB 1.0.2 was built from source.
  • Riak 0.14.1 was installed from the .deb available from Basho.
  • Before each type of test the servers were rebooted to clear the filesystem cache.
  • Tests were run from a 5th node not running Riak or CouchDB. For the 4-node tests, the 5th node ran HAProxy 1.4.8 to round-robin client connections amongst the Riak nodes.

So without further ado…the numbers:

Generate Keys (secs) Show Keys w/ redeemed_count < 50 (secs)
CouchDB 1.0.2 (cold view) 495 74
CouchDB 1.0.2 (computed/warm view) 495 11
Riak 0.14.1 (Raw MapReduce) 1-node 358 82
Riak 0.14.1 (Indexed w/ Raw MapReduce) 1-node 692 65
Riak 0.14.1 (Indexed w/ Key Filter MR) 1-node 692 56
Riak 0.14.1 (Indexed w/ Raw MapReduce) 4-node 1025 40
Riak 0.14.1 (Indexed w/ Key Filter MR) 4-node 1025 34


Or if you’re more visual like me:


(If you’d like to run the tests yourself, we’ve put the code up: riak_perf_test.py, couchdb_perf_test.py).


Analyzing the outcome

One thing that’s very clear is how fast computed views can be in CouchDB (11 seconds flat is nothing to shake a stick at). However, what did we learn from the Riak numbers?

  • Indexed insertion is 91% slower than storing just the key data.
  • MapReduce with indexes is 20% faster than MR on the data keys alone.
  • MapReduce with indexes and key filters is 32% faster than MR on the data keys alone.
  • Adding Riak nodes substantially reduces query time. Adding 3 more nodes speeds up queries by 40%.

It’s not surprising that insertion time doubles with indexes, since you’ve just doubled the number of keys you’re inserting. However, the gains you get can be dramatic. Once the bug with Javascript reduce phases is ironed out, I expect the performance on this test to go even higher (since it will run only one reduce phase instead of the map code multiple times).

What’s a little puzzling is why insertion of keys on a 4-node cluster is 40% slower than a 1-node cluster? I had expected insertion to be the same speed or 25% faster. The reason I’d expected this is because Riak was set to use a write n-value of 3…meaning for every key inserted 3 copies were stored throughout the cluster. Accounting for coordination latency on a 3-node cluster, I’d expect almost the same insertion speed as a 1-node Riak instance. With an extra node in the cluster, I’d expect slightly faster performance since only 3 nodes out of the cluster are engaged in any given write.

Regardless, the query performance proves Riak is a good choice for our use case. While 34 seconds to answer the question is slower than the 11 seconds it took CouchDB, it’s clear that as we scale the cluster our query performance will scale with the size of our dataset. Providing we can find a solution for the 50% slower insertion speed, Riak will definitely be our datastore of choice for this project. Once again, Riak is incredibly impressive at how well it handles large data sets and how adept it’s simple toolset is at answering complex questions.


Where we go from here…and nagging questions

The indexed key approach has worked so well for us that we’re currently writing a library to layer on top of txRiak to transparently handle writing/updating indexes. I’ll put up another blog entry on that once it’s posted to Github (there’s a few issues we have to handle like escaping the separators, and we intend to use Riak’s Links to provide an alternate connection between index and data keys). Even more exciting is the news that Basho is currently working on adding native secondary index support to Riak. No news on how that will take shape, but I expect it will blow the performance of our homegrown indexes out of the water. I think built-in indexes are a cleaner more maintainable approach. Maintaining indexes in the client puts a lot of pressure on the clients not to screw up the index accidentally…especially if you’ve got code written in multiple languages accessing that data.

The only real nagging question right now was an issue we saw when we attempted to add a 5th node to the Riak cluster. I had originally intended to do an analysis of how much query performance improved with each node added. However, when the 5th node was added to the cluster it took close to 1 hour for Riak to fully redistribute the keys…and even then 3 of the 5 nodes showed that they were still waiting to transfer one partition each to another node. When we attempted to run the MapReduce index query against the newly expanded cluster, we received MapReduce errors that Riak couldn’t find random index keys as it attempted to pass these “missing” keys into the map phase. I suspect the culprit maybe some “failed node” testing we did before adding the 5th node.

Overall, the takeaway for me is that Riak is a phenomenally flexible data store, and that just because it’s missing a feature doesn’t mean you should shun it for that workload. More often than not, a little thought and chaining together of Riak’s other very powerful tools will give you the same result. To me, Riak vs CouchDB (or vs. SQL) is like a RISC chip vs. a CISC chip. It may not have one complex instruction you need, but you can build that instruction out of much simpler ones that you can run and scale at twice the speed.

Cloud-scale DBs in the cloud…just a quickie


Posted by Jason | Posted in DigiTar, Software Development | Posted on 03-17-2010

Tags: , , ,

Just a quick set of thoughts…do cloud-scale DBs save money because they’re based on commodity/cheap servers? Tonight I did some rough back-of-the-pad calculations, and was kind of surprised…

Let’s assume we’ve got an 11TB working set of data, how could we store this redundantly?

(cloud servers in these examples are dedicated servers at a cloud provider)

Option 1: Two beefy storage servers running MySQL in a master/slave config

  • CPU: 4-cores of your favorite CPU vendor
  • RAM: 16GB
  • HDDs: 48x 250 GB SATA
    • Lose 2 for mirrored boot, and 2 for RAID-6 parity
  • Cost:
    • Buy Your Own Hardware (Sun X4500): $50,000 for the pair
    • Host It in the Cloud (SoftLayer): $4,700/month for the pair

Option 2: 28 commodity servers (2 replica copies for each piece of data) running HBase or Cassandra

  • CPU: 4-cores of your favorite CPU vendor
  • RAM: 4GB
  • HDDs: 4x 250 GB SATA
    • Lose 1 for RAID-5 parity (we’ll mingle boot data and data data on the same drive pool)
  • Cost:
    • Buy Your Own Hardware (Dell R410): $43,300 for set of 28
    • Host It in the Cloud (SoftLayer): $12,000/month for the set of 28

Option 3: 42 commodity servers (3 replica copies for each piece of data) running HBase or Cassandra

  • CPU: 4-cores of your favorite CPU vendor
  • RAM: 4GB
  • HDDs: 4x 250 GB SATA
    • Lose 1 for RAID-5 parity (we’ll mingle boot data and data data on the same drive pool)
  • Cost:
    • Buy Your Own Hardware (Dell R410): $64,900 for set of 42
    • Host It in the Cloud (SoftLayer): $18,000/month for the set of 42

Now the issue here that surprised me isn’t the raw cost differential between stuffing your own hardware in your colo or using a cloud provider. And the other thing is, I’m not picking on SoftLayer…Rackspace and Voxel all work out to the same cost scaling as SoftLayer (and in the case of the other two vendors worse).

What surprised me:

  • When you buy your own hardware, “cloud-scale” databases do cost you less (~$7K) than buying beefy storage servers and running MySQL for the same data set.
  • However, when you are at a cloud provider, using cloud-scale databases on “cheap” hardware costs you 3x more than using beefy storage cloud servers running MySQL.

As I said, I’m not comparing the cost of running Option 1 on your own hardware vs. Option 1 at a cloud provider. Yes those costs are more at the cloud provider, but it’s to be expected (they’re bundling in bandwidth, colo, power, and most importantly people to manage the hardware and network).

What’s stunning is that beefy servers at a cloud provider are much more cost efficient. Beefy cloud servers cost you roughly 1/15 of the cost of the hardware every month. Whereas, “cheap” commodity cloud servers cost you roughly 1/3 of the cost of the hardware every month. Much higher mark up on the cheaper volume servers.

Please comment and correct me if I’m wrong in my analysis…I would actually like to be.

Viva la storage.


Posted by Jason | Posted in DigiTar, Solaris, Technology | Posted on 11-10-2008

Coming soon… ;-)

Remember the Alamo…


Posted by Jason | Posted in DigiTar, Solaris, Technology | Posted on 05-28-2008

Tomorrow (05/28/2008) I'm giving a talk on moving to open storage (i.e. ethernet, OpenSolaris and SATA…in no particular order) at the Diocesan Information Systems Conference in San Antonio. It's a closed event, but here are the slides from the talk…including the talking notes which cover a lot more than I'll probably have time for:


Democratizing Storage


Posted by Jason | Posted in DigiTar, Solaris | Posted on 04-21-2008


As a company that was heavily populated with Linux zealots, it’s been surreal for us to watch OpenSolaris develop for the past 3 years. While technologies like DTrace and FMA are features we now use everyday, it was storage that brought Solaris into our environment and continues to drive it deeper into our services stack. Which begs the question: Why? Isn’t DTrace just as cool as ZFS? Haven’t Solaris Containers dramatically changed the way we provision and utilize systems? Sure…but storage is what drives our business and it doesn’t seem to me that we’re alone.

Everything DigiTar does manipulates or massages messaging in some way. When most people think of what drives our storage requirements they think of quarantining or archiving e-mail. But when you’re dealing with messages that can make or break folks’ businesses, logging the metadata is perhaps the most important thing we do.

Metadata is flooding in every second. It’s at the center of everything from proving a message was delivered to ensuring we meet end-to-end processing times and SLAs. If we didn’t quarantine any more messages, we’d still generate gigabytes of data every day that can’t be lost. Without reliable and scalable storage we wouldn’t exist.

Lost IOPs, Corruption and Linux…oh my!

What got us using OpenSolaris was Linux’s (circa 2005) unreliable SCSI and storage subsystems. I/Os erroring out on our SAN array would be silently ignored (not retried) by Linux, creating quiet corruption that would require fail-over events. It didn’t affect our customers, but we were going nuts managing it. When we moved to OpenSolaris, we could finally trust that no errors in the logs literally meant no errors. In a lot of ways, Solaris benefits from 15 years of making mistakes in enterprise environments. Solaris anticipates and safely handles all of the crazy edge cases we’ve encountered with faulty equipment and software that’s gone haywire.

When it comes to storing data, you’ll pry OpenSolaris (and ZFS) out of our cold dead hands. We won’t deploy databases on anything else.

Liberation Day

While we moved to Solaris to get our derrières out of a sling, being on OpenSolaris has dramatically changed the way we use and design storage.

When you’ve got rock-solid iSCSI, NFS, and I/O multipathing implementations, as well as a file system (ZFS) that loves cheap disks…and none of it requires licensing…you can suddenly do anything. Need to handle 3600 non-cached IOPs for under $60K? No problem. Have an existing array but can’t justify $10K for snapshotting? No problem. How ‘bout serving line-rate iSCSI with commodity storage and CPUs? No problemo.

That’s the really amazing thing about OpenSolaris as a storage platform. It has all of the features of an expensive array and because it allows you to build reliable storage out of commodity components, you can build the storage architecture you need instead of being held hostage by the one you can afford. But features like ZFS don’t mandate that you change your architecture. You can pick and choose the pieces that fit your needs and make any existing architecture better too.

So how has OpenSolaris changed the way DigiTar does storage? For one thing, it’s enabled us to move almost entirely off of our fibre-channel SAN. We get better performance for less money by putting our database servers directly on Thumpers (Sun Fire X4500) and letting ZFS do its magic. Also, because its ZFS, we’re assured that every block can be verified for correctness via checksumming. By doing application-level fail-over between Thumpers, we get shared-nothing redundancy that has increased our uptime dramatically.

One of the things that always has bugged me about traditional clustering is its reliance on shared storage. That’s great if the application didn’t trash its data while crashing to the ground. But what if it did? To replicate the level of redundancy we get with two X4500s, we’d have to install two completely separate storage arrays…not to mention also buy two very large beefy servers to run the databases. By using X4500s, we get the same reliability and redundancy for about 85% less cost. That kind of savings means we can deploy 6.8x more storage for the same price footprint and do all sorts of cool things like:

  • Create multiple data warehouses for data mining spam and mal-ware trends.
  • Develop and deploy new service features whenever we want without considering storage costs.
  • Be cost competitive with competitors 10x our size.

Whether you’re storing pictures of your kids, or archiving business critical e-mail (or anything in between), it seems to me that being able to store massive amounts of data reliably is as fundamental to computing today as breathing is to living. OpenSolaris allows us as a company to stop worrying about what its going to cost to store the results of our services, and focus on what’s important: developing the services and features themselves. When you stop focusing on the cost of “air”, you’re liberated to actually make life incredible.

I could continue blathering about how free snapshotting (both in terms of cost and performance hit) can allow you to re-organize your backup priorities, or a bunch of other very cool benefits of using OpenSolaris as your storage platform. But you should give it a shot yourself, because OpenSolaris’ benefits are as varied and unique as your environment. Once you give it a try, I think you’ll be hard pressed to go back to vendor lock-in…but I’m probably a bit biased now.  I think you’ll also find an community around OpenSolaris that is by far the friendliest and most mature open source group of folks you’ve ever dealt with.

Back in the sandbox…ZFS flushing shenanigans revisted.


Posted by Jason | Posted in DigiTar, Solaris | Posted on 10-31-2007

Nearly a year has passed since our descent into the 9th ring of latency Hades, and I wanted to make an update post on ZFS' interaction with SAN arrays containing battery-backed cache. (For the full details, please check out this older post.)

For one thing, the instructions I previously gave to ignore cache flushes on the STK FLX200/300 series (and similar LSI OEM'd products), don't seem to work very well on the new generation Sun StorageTek 6×00 arrays. Not to mention it's kind of nasty to have to modify your array's NVRAM settings to get good write latency.

But thanks to the brilliant engineers on the ZFS team, you no longer have to modify your array (since circa May '07 in the OpenSolaris tree). Simply add this line to your Solaris /etc/system file and ZFS will no longer issue SYNCHRONIZE CACHE commands to your array:

set zfs:zfs_nocacheflush=1 

I can confirm that this works REALLY well on both the older (FLX200/300) and newer (6140/6540) Sun/Engenio arrays! It seems to me that since the new way is a ZFS configuration directive, it should be portable/functional against any array in existence. Please note that setting this directive will disable cache flushing for ALL zpools on the system, which would be dangerous for any zpools using local disks. As always, caveat emptor. Your mileage may vary so please do let others know through the comments what works/doesn't work for you.

We've tested the zfs:zfs_nocacheflush directive successfully in Build 72 of OpenSolaris. It should also work in Solaris 10 Update 4, though we haven't tested that ourselves.

Technorati Tags: , ,

Nagios Remote Plug-In Executor (NRPE) under SMF


Posted by Jason | Posted in DigiTar, Solaris | Posted on 02-22-2007

NRPE (Nagios Remote Plug-In Executor) is a critical part of a lot of IT environments. In ours it provides to Nagios all sorts of interesting health info local to the host that NRPE is running on. Whether its RAM, open connections, hard drive space or something else, NRPE helps alert you to strange happenings that simply interrogating a TCP port remotely won't provide. Hence, its a deal breaker to moving to OpenSolaris if you can't have it. Luckily, the benevolent gents at Blastwave provide
a pre-packaged NRPE that's ready to go (Run: pkg-get -i nrpe). Unfortunately, the Blastwave NRPE package leaves the last step of placing it under init.d or SMF control as an exercise for the admin. Well, if you're like me and would like SMF to be able to manage NRPE, then you're in luck. Below are a manifest and installation instructions that will start, stop and refresh an NRPE daemon (as installed from
the Blastwave package).

Its important to note that this NRPE manifest will expect your NRPE configuration to be in /opt/csw/etc/nrpe.cfg and that it will contain the line: pid_file=/var/run/nrpe.pid If your config file is in a different location, just edit method/nagios-nrpe in the manifest package to match where your nrpe.cfg lives. If for some reason you don't want to specify pid_file in your nrpe.cfg, then the refresh method
will not operate properly. The start and stop methods will operate whether you specify a pid_file value or not. Technically, just restarting the NRPE daemon will accomplish the same thing as the refresh method, which just sends a SIGHUP to the NRPE daemon. Again, caveat emptor. This manifest and the installation instructions below are provided with absolutely no warranty whatsoever as specified in the
BSD license in the manifest header.

To install the manifest please follow these steps:

  1. Download the NRPE manifest package here.
  2. Unpack the package on your system.
  3. Change to the root of the unpacked package.
  4. Run: cp ./manifest/nagios-nrpe.xml /var/svc/manifest/network/
  5. Run: cp ./method/nagios-nrpe /lib/svc/method/
  6. Run: svccfg import /var/svc/manifest/network/nagios-nrpe.xml
  7. You're done!

If everything went smoothly, running svcadm enable nrpe should start the daemon without incident. Similarly, svcadm disable nrpe should kill it. As mentioned before, there's also svcadm refresh nrpe, which will send a SIGHUP to NRPE. That will cause NRPE to re-read its nrpe.cfg file. An interesting note on refresh is that NRPE will reliably crash on a second SIGHUP. If you were using standard init.d,
this could really hose you, as NRPE would randomly terminate and you wouldn't know. With SMF however, it doesn't matter! If NRPE dies when you send it a SIGHUP, SMF will loyally restart the daemon for you. Another reason to use SMF with all of your critical services, where an automatic restart won't risk data corruption! Hope y'all find this of use!

Technorati Tags: , , ,

OpenSolaris & SMF adventures with PowerDNS


Posted by Jason | Posted in DigiTar, Solaris | Posted on 02-21-2007

One of the quiet parts that powers our logistics infrastructure is PowerDNS. Its a very powerful way to serve DNS records that you need the ability to update programmatically. Unfortunately, OpenSolaris (or Solaris 10 for that matter) isn't exactly considered kosher over in PowerDNS-land. Like a lot of OSS projects, PDNS hasn't kept up with the times and treats OpenSolaris like a red-headed step-child. If you like red-headed step-children like we do, then you're in for about
8 hours of greasing, coaxing and pleading to get it compiled right. Well either that…or you can read on and get it up in about a 30 minutes.. :-) As a side-bonus, you'll also have PDNS managed by the coolest way ever invented to replace init.d: SMF.

Installing PDNS on OpenSolaris/Solaris 10 x64…

First thing you'll need to do is get Blastwave installed on your Solaris box. You could try and build the unholy abomination that is Boost on your own…but then you're a braver soul than I. As its getting late, please excuse that the steps are brief and bulleted (feel free to harass me if you have questions):

  1. Make sure your path is set correctly. This path will do nicely: PATH=/usr/sbin:/usr/bin:/opt/csw/bin:/usr/sfw/bin:/usr/ccs/bin
  2. You'll need all the dev tools that come with a standard Solaris 10/OpenSolaris install…make, gcc, g++, ld etc. (You don't need Studio 11 installed. In fact, PDNS will really NOT like Studio 11 so please use gcc 3.3 or later).
  3. Run: pkg-get -i mysql5client
  4. Run: pkg-get -i mysql5devel
  5. Run: pkg-get -i boost_rt
  6. Run: pkg-get -i boost_devel
  7. Run: ln -s /opt/csw/mysql5/lib/mysql /usr/lib/mysql (This will make pathological configure scripts work a lot more smoothly.)
  8. Run: crle -l /lib:/usr/local/lib:/opt/csw/lib:/usr/lib:/opt/csw/mysql5/lib (This will help your compiled PDNS binaries find all the libraries they need at runtime. Run crle by itself first to see if there are any additional paths on your system that need to be present on this list. Caveat emptor..you run this command at your own risk as it can really bork your system if you don't know what you're doing.)
  9. Unpack the latest PDNS sources which you can get here (these instructions are known to work against 2.9.20).
  10. From within the PDNS source tree root run: ggrep -R “u_int8_t” *
  11. Manually change all the u_int8_t references that grep finds to uint8_t. If you don't do this, good 'ol crotchety PDNS will not compile. (I know I should provide a patch. I'll try and do that in the next couple of days if possible).
  12. From the PDNS source tree root run: ./configure –localstatedir=/var/run –with-pic –enable-shared –with-mysql-includes=/opt/csw/mysql5/include/ CXXFLAGS=”-I/opt/csw/include -DSOLARIS” LDFLAGS=”-L/opt/csw/lib -lsocket -lnsl”
  13. Run: make install (This will use the prefix /usr/local/ to install everything. The SMF manifest later will expect your pdns.conf to be in /usr/local/etc/ as a result. For sanity purposes on our systems, we also symlink pdns.conf into /etc.)
  14. Bingo! Presto! You have a working PDNS server…hopefully.

Life support for PDNS…that is running PDNS under SMF…

Service Management Facility (SMF) is a truly wonderful thing. It completely replaces init.d and inet.d, gives you a standard way of managing both types of services, understands dependencies, restarts dead services…and washes your car while you sleep. ;-) The only hiccough is you've got to write a manifest to run PDNS under SMF…or use the one below. :-D Again…caveat emptor…this SMF manifest comes with absolutely no warranty at all. Read the BSD license
header at the top of the manifest for a complete description of how much its your own darn fault if this manifest totals your system. The DigiTar SMF manifest for PDNS has a couple of neat integration features:

  • If PDNS is already started when you run svcadm enable powerdns, it will error out such that SMF will mark PDNS' service description into a maintenance state, and will place an informative message in the PDNS SMF service log.
  • If you accidentally delete the pdns_server binary, SMF will not let you start the service and will place it into a maintenance state so you know something is wrong.
  • Running svcadm refresh powerdns will instruct PDNS to scan for new domains that have been added (pdns_control rediscover), as well as rescan for changes to records in existing domains (pdns_control reload).

OK, enough jabbering. Here's how you install the SMF manifest:

  1. Download the DigiTar PowerDNS SMF package here.
  2. Unpack the package on your system.
  3. Change to the root of the unpacked package.
  4. Run: cp ./manifest/dns-powerdns.xml /var/svc/manifest/site/
  5. Run: cp ./method/dns-powerdns /lib/svc/method/
  6. Run: svccfg import /var/svc/manifest/site/dns-powerdns.xml
  7. You're done!

You should now be able to start your PDNS server with a simple svcadm enable powerdns. Stopping PDNS is similarly simple: svcadm disable powerdns. If you just want to see the state of the PDNS service try svcs powerdns. That's it! You can sleep well at night knowing if PDNS goes the way of all flesh, SMF will auto-restart it for you. Try a pkill pdns and watch the process IDs change. :-) If you're PDNS service won't start take a look at svcs
to see why. Anywho…off to the sand man for me. If you have any questions, please feel free to contact me: williamsjj_@_digitar.com

Technorati Tags: , , ,

Shenanigans with ZFS flushing and intelligent arrays…


Posted by Jason | Posted in DigiTar, Solaris | Posted on 12-14-2006


NOTE: ZFS has been enhanced to better address the situation described below by using ZFS configuration directives. This article is still accurate and provides decent background on the problem. However, an update has been posted with the newer, stronger, better way of resolving the problem: Back in the sandbox…ZFS flushing shenanigans revisited. :-)

Running operations for start-up company is interesting…you learn a lot of things the hard way. Among the things you learn is how nuanced it is to deal with databases and storage under heavy traffic. Before I start my little diatribe, let me profusely thank Richard Elling and Roch Bourbonnais at Sun for saving our bacon. They are stellar engineers and we are more grateful than words can say for their help in resolving our ZFS roadblocks. They’re another example of the people who make Sun the company we love
to work with. Sun is blessed to have them.

About 6 months ago we moved to a new Sun StorageTek FC array, and used the opportunity to move to ZFS. We loved ZFS in development and frankly it kicks the pants off of UFS/SVM (Solaris Volume Manager). It is SO much easier to deal with for volume management, and the block checksums help you quickly eliminate your storage when tracking down corruption. That being said, ZFS really has some interesting quirks. One of them is that it is truly designed to deal with dumb-as-a-rock storage. If you have a box of SATA
disks with firmware flakier than Paris Hilton on a coke binge, then ZFS has truly been designed for you.

As a result, ZFS doesn’t trust that anything it writes to the ZFS Intent Log (ZIL) made it to your storage, until it flushes the storage cache. After every write to the ZIL, ZFS executes an fsync() call to instruct the storage to flush its write cache to the disk. In fact, ZFS won’t return on a write operation until the ZIL write and flush have completed. If the devices making up your zpool are individual hard drives…particularly
SATA ones…this is a great behavior. If the power goes kaput during a write, you don’t have the problem that the write made it to drive cache but never to the disk.

The major problem with this strategy only occurs you when you try to layer ZFS over an intelligent storage array with a decent battery-backed cache. Enter our issues with ZFS on StorageTek/Engenio arrays.

Most of these arrays have sizable 2GB or greater caches with 72-hour batteries. The cache gives a huge performance boost, particularly on writes. Since cache is so much faster than disk, the array can tell the writer really quickly, “I’ve got it from here, you can go back to what you were doing”. Essentially, as fast as the data goes into the cache, the array can release the writer. Unlike the drive-based caches, the array cache has a 72-hour battery attached to it. So, if the array loses power and dies, you
don’t lose the writes in the cache. When the array boots back up, it flushes the writes in the cache to the disk. However, ZFS doesn’t know that its talking to an array, so it assumes that the cache isn’t trustworthy, and still issues an fsync() after every ZIL write. So every time a ZIL write occurs, the write goes into the array write cache, and then the array is immediately instructed to flush the cache contents to the disk. This means ZFS doesn’t get the benefit of a quick return from the
array, instead it has to wait the amount of time it takes to flush the write cache to the slow disks. If the array is under heavy load and the disks are thrashing away, your write return time (latency) can be awful with ZFS. Even when the array is idle, your latency with flushing is typically higher than the latency under heavy load with no flushing. With our array honoring ZFS ZIL flushes, we saw idle latencies of 54ms, and heavy load latencies of 224ms. This crushed the MySQL database running on top of it.
The InnoDB tables are particularly sensitive to this, because they issue 3x more writes than InnoDB tables. Also, since InnoDB tables use disk-based transactions, you can get write-loads that are orders of magnitudes greater than InnoDB. If the disk-latency gets bad enough, InnoDB will completely lock-up the MySQL process with deadlocks.

So where does this leave a hapless start-up? Fortunately, you don’t have to give up ZFS. You have two options to rid yourself of the bane of existence known as write cache flushing: *** Please check out the update to this article here! There’s a better way now! ***

  • Disable the ZIL. The ZIL is the way ZFS maintains consistency until it can get the blocks written to their final place on the disk. That’s why the ZIL flushes the cache. If you don’t have the ZIL and a power outage occurs, your blocks may go poof in your server’s RAM…’cause they never made it to the disk Kemosabe.
  • Tell your array to ignore ZFS’ flush commands. This is pretty safe, and massively beneficial.

The former option, is really a no go because it opens you up to losing data. The second option really works well and is darn safe. It ends up being safe because if ZFS is waiting for the write to complete, that means the write made it to the array, and if its in the array cache you’re golden. Whether famine or flood or a loose power cable come, your array will get that write to the disk eventually. So its OK to have the array lie to ZFS and release ZFS almost immediately after the ZIL flush command executes.
On our StorageTek FLX210 this took the idle latencies to 1ms and the heavy load latencies to 9ms. 9 bloody milliseconds! Our InnoDB problems disappeared like sand down a rat hole.

So how do you get your array to ignore SCSI flush commands from ZFS? That differs depending on the array, but I can tell you how to do it on an Engenio array. If you’ve got any of the following arrays, its made by Engenio and this may work for you:

  • Sun StorageTek FlexLine 200/300 series
  • Sun StorEdge 6130
  • Sun StorageTek 6140/6540
  • IBM DS4x00
  • many SGI InfiniteStorage arrays (you’ll need to check to make sure your array is actually OEM’d from Engenio)
  • (if you have another Engenio OEM’d array manufacturer, just let me know and I’ll update the list.)

Before I give you the instructions, I must warn you that the following instructions come with no warranty whatsoever. These instructions are from me alone and have no blessing conferred by, warranty from, acceptability by, or connection with my employer DigiTar. Neither I nor my employer can be held responsible for the consequences resulting from the use of these instructions, and if you use them you absolve us both individually and collectively from any responsibility for the accuracy of these instructions
or the consequences of using these instructions. These instructions are potentially dangerous and may cause massive data loss. Caveat Emptor.

Okay, tush-covering mumbo jumbo over. On a StorageTek FLX210 with SANtricity 9.15, the the following command script will instruct the array to ignore flush commands issued by Solaris hosts:

//Show Solaris ICS option

show controller[a] HostNVSRAMbyte[0x2, 0x21];

show controller[b] HostNVSRAMbyte[0x2, 0x21];

//Enable ICS

set controller[a] HostNVSRAMbyte[0x2, 0x21]=0×01;

set controller[b] HostNVSRAMbyte[0x2, 0x21]=0×01;

// Make changes effective

// Rebooting controllers

show “Rebooting A controller.”;

reset controller[a];

show “Rebooting B controller.”;

reset controller[b];

If you notice carefully, I said the script will cause the array to ignore flush commands from Solaris hosts. So all Solaris hosts attached to the array will have their flush commands ignored. You can’t turn this behavior on and off on a per host basis. To run this script, cut and paste the script into the script editor of the “Enterprise Management Window” of the SANtricity management GUI. That’s it! A key note here is that you should definitely have your server shut down, or at minimum
your ZFS zpool exported before you run this. Otherwise, when your array reboots ZFS will kernel panic the server. In our experience, this will happen even if you only reboot one controller at a time, waiting for one controller to come back online before rebooting the other. For whatever reason, MPXIO which normally works beautifully to keep a LUN available when losing a controller, fails miserably with this situation. Its probably the array’s fault, but whatever the issue, that’s the reality. Plan for downtime
when you do this.

In the words of the French, c’est tout…that’s all folks. This cleared up all of the ZFS latency problems we’ve been having. Hopefully, this experience will be helpful to other people. This behavior isn’t well documented outside of the ZFS mailing lists, which is why we’re documenting it here for the world to index and find. More importantly, public documentation on Engenio-based arrays is downright abysmal. If you search hard enough, you’ll
find an IBM Red Book paper that tells you the array can ignore flush commands, but happy hunting if you actually want to know how to enable the behavior.

Just a quick note before closing…ZFS rocks. Its that simple. So much arcane black magic disappears under the skirt of ZFS, but as always you can’t make it all go away. If anyone has instructions on how to configure non-Engenio arrays to ignore flush commands, please let me know. Stay tuned for a diatribe..er..discussion on the kernel panic behavior of ZFS. G’night y’all.

SunFish Chum…er…Odds and ends.


Posted by Jason | Posted in DigiTar, Technology | Posted on 08-15-2006

Currently, we're putting the N1400Vs into production and there were two odds and ends that came to mind that I wanted to mention:

  1. No client/server settings per port! Hooray! The Alteons (even the 2424s) inherited from the Alteon AD4s and 184s the need to enable client and/or server processing per port. For those who are not familiar, server load balancing basically can be reduced into two operations:
    • Client processing: When a packet comes in from a web browser to the web switch, its header has a TO fieldthat's the IP address of the web switch, and a FROM fieldthat's the IP address of the web browser. Once the web switch gets the packet and decides which back-end server to send it to, it has to replace the packet's TOwith the IP address of the back-end server. If the web switch didn't change the TOand simply sent the packet on, the server would ignore the packet. Sort
      of like receiving a letter addressed to somebody you don't know. So in a nutshell, server processing is simply replacing the web switch's IP address with the selected back-endserver's IP address in packets from the client.
    • Server processing: When the back-end server decides to send a response packet back to the client, the reverse of server processing has to occur. If the web switch were to simply send the packet from the server back to the client without client processing, the client would ignore the packet. Why? Well, the client sent the packet to the IP address of the web switch and expects a reply from that IP address, not the server's IP. It sort of like sending a letter to Aunt Gertie, but getting the reply
      from Aunt Gertie's nurse Josie. You don't know who Josie is, so you toss the reply thinking its junk mail. Client processing fixes this by rewriting the FROM in the server's reply to the IP address of the web switch.
    • An Alteon is a bit unusual in that instead of one massive SLB processor it has 8…one per port (this is fixed in the 2424s, but they imitate the older behavior for backward compatibility). So if you have one port connected to your servers and a second port connected to the Internet, you have to enable Client processing on the Internet-facing port and Server processing on the server-facing port. The reason is that the 8 individual processors aren't bulky enough to do BOTH the client and server processing. As
      a result, the operation gets split between ports in a way you specify. So you have to remember which kind of processingis which, and set it appropriately on the right ports. This is a MAJOR pain in the butt. If you get client and server processing confused and set a port to the wrong one, load balancing just isn't gonna work for you today.
    • The SunFish don't have this limitation. They just make it work. Concentrate on creating your VIPs and RIPs and the rest is taken care of for you. Its really a spectacular change for us! It was so easy, that it wasn't until I was driving home that it struck me I hadn't had to fool with client or server processing at all.
  2. XML-over-HTTP! As I was complaining about the lack of a heads-up-display on the SunFish, I ran into a very cool feature!On most of the pages that list settings or statistics in the SunFish WebUI, there's a little button labeled “XML”. If you click on it, you get the settings or stats you were looking at…but XML encoded! This means you can write your own scripts to consume the status of the SunFish! All your program needs to be capable of is downloading pages via HTTP, and consuming XML.
    The upshot is that this feature enables us to write our own stop gap heads-up-display. :-) Its much simpler than messing around with SNMP calls and the like. Particularly, given our familiarity with consuming web services. This is a terrific feature! Props to the SunFish team for providing an XML interface to the unit. Simply amazing.

Technorati Tags: ,