Riak vs. CouchDB for storing 100,000+ coupons…

6

Posted by Jason | Posted in DigiTar, NoSQL, Software Development | Posted on 03-07-2011

Tags: , , , , , ,

We’ve been using both CouchDB and Riak for sometime now on a number of our APIs and user-facing services. When CouchDB wins out over Riak it’s usually for two reasons:

  • secondary indexes
  • multi-data center replication (would be great if Basho would open source this)

Both Riak and CouchDB excel at storing records under a primary key, but sometimes you need to know a different axis on the data. For example, all of our records are stored in JSON dictionaries and at times we want to know all the records that match a particular field in the dictionaries. That’s the situation we’re in for a new service we’ll be standing up soon. We wanted to generate coupon codes that customers could redeem for service as an alternative to providing a credit card. The coupon codes can be sold through affiliates, so one of the axes we’ll need to know in addition to the coupon code is what are all the coupon codes belonging to affiliate ID X?

One approach (without using secondary indexes) would be to try and encode both the code and the affiliate ID in the record’s key (e.g. prefix_code_affiliate_id). The main issue with that approach is that some of our access patterns don’t have access to the affiliate ID (e.g. a user signing up for service), they only know the coupon code. So we need fast lookup of the record based on the code alone. That pretty much eliminated map/reducing for the coupon code, and firmly established the code alone as the right choice for the data key. The perfect solution would be a secondary index on the “affiliate_id” of the JSON dictionary…in other words, a map of affiliate ID to the data. Normally, this is where we’d turn to CouchDB’s views and call it a day. But we’re planning on having millions of coupons in the system with thousands of parallel accesses that need to be evenly loaded across the datastore…not a scenario where CouchDB excels. Riak would be the perfect choice…except there’s no native secondary indexes.

Doing indexes in Riak

There’s a couple ways you can do your own indexes in Riak. The simplest approach is to just create an index key of the form idx_<field_name>_<field_value> and shove in a JSON list of the keys containing matching records. What you’ll run into very quickly is multiple clients trying to update that index key with new records, and overwriting each other. Since Riak keeps multiple versions in the event of a conflict, you can code your clients to auto-merge conflicted versions into one master list and re-post the index. But…that puts a lot of maintenance logic in your clients, and in the event one of those updates is deleting a key from the index, the merge process can put the deleted key back into the index.

Since we don’t want index management to have that many moving parts we came up with a different approach:

  • For every field being indexed on a particular record, create a new separate index key in a dedicated index bucket.
  • Store the indexed field’s name in the bucket name for the index key and store the indexed value and data key’s name in the name of the index key.
  • MapReduce to get a list of matching index keys for any particular question by iterating over the keys of an index’s bucket and splitting the index key name apart to analyze the value.

The format of an index key becomes (we use key prefixes to namespace different data key types in the same bucket):

Bucket Name: idx=<field_name>=<data_key_prefix>
Key Name: <data_key_name>/<field_value>
Key Value: empty

The immediate advantage of having an index key for every indexed field of a data key is the reduced chance of write conflicts when maintaining the index (you pretty much eliminate the chance a deleted index key is going to get resurrected). Asking the question “How many coupons have a redeemed count < 50?” becomes a simple MapReduce job that iterates over the idx=redeemed_cnt=coupon index bucket to find index keys where the field_value is < 50.

You might have noticed that we don’t store any data in the value of the index key… That’s on purpose, because it allows us to leverage a new feature of Riak 0.14…key filters for MapReduce jobs.

Key filters

The index system described so far would work fine on any key/value store with support for MapReduce. However, one problem is that every key in the bucket has to be analyzed by the Javascript map and reduce phases to determine if it matches the question (i.e. is this indexed value < 50). The problem is one of optimizing query performance. It takes Riak more time to run a user-supplied Javascript function to see if a key matches than it would take for Riak to analyze the index key itself.

Luckily the smart folks at Basho gave us a new tool to do just that with key filters. By encoding the indexed value in the key name we can tell Riak via a key filter to:

  1. Tokenize the index key name using “/” as the separator.
  2. Look at the second token after the split (i.e. the indexed value).
  3. Treat that token as an integer.
  4. Only give the index key to the MapReduce job if the integer value is < 50.

In fact, with key filters we actually don’t have to write our own MapReduce phases to answer this question anymore. All we have to do is construct the key filter, and tell Riak to use the “identity reduce” reduce phase that’s built in (skip the map phase entirely). What we’ll get back is a list of index keys whose indexed value is < 50. We can then split those index key names in our client to get the key names of the data keys they map to.

Rubber meets the road…

So what does the performance look like with all of this? We wrote a couple of tests using Twisted Python to benchmark loading 100,000 coupons into CouchDB and Riak and then asking both how many of those coupons had a redeemed count < 50. Here’s a legend for what the different test titles mean:

  • CouchDB 1.0.2 (cold view): The amount of time it takes the view (index) to answer the question the first time the view is queried. This number is important because CouchDB doesn’t build the view until you query it the first time. From then on it just incrementally updates the view with changed values.
  • CouchDB 1.0.2 (computed/warm view): Amount of time it takes the view to answer the question on subsequent queries after the view has been computed initially.
  • Riak 0.14.1 (Raw MapReduce) 1-node: No indexes used as described above. A brute force MapReduce job that iterates over the data keys and examines the reduced_count field in the JSON. 1-node Riak “cluster”.
  • Riak 0.14.1 (Indexed w/ Raw MapReduce) 1-node: Using index keys as described above, but using Javascript MapReduce phases on the index bucket to produce the matching key list…no key filters used. 1-node Riak “cluster”.
  • Riak 0.14.1 (Indexed w/ Key Filter MR) 1-node: Using index keys as described, but with key filters to reduce the input and a simple Javascript map phase to reformat the output (this would be a JS reduce phase except Riak has a bug right now with MapReduce jobs that have only a JS reduce phase). 1 -node Riak “cluster.
  • Riak 0.14.1 (Indexed w/ Raw MapReduce) 4-node: Same as “Indexed w/ RawMapReduce” above except done on a 4-node Riak cluster.
  • Riak 0.14.1 (Indexed w/ Key Filter MR) 4-node: Same as “Indexed w/ Key Filter MR” above except done on a 4-node Riak cluster.

Before I show the numbers, you’d probably like to know what the test setup looked like. Each node was a SoftLayer Cloudlayer server with these specs (if you haven’t tried them, SoftLayer is really a phenomenal provider):

  • 1x 2.0GHz Xeon CPU
  • 2GB RAM
  • 25GB HDD
  • Gigabit NICs
  • Ubuntu 10.04.1 64-bit
  • Dallas 05 Datacenter
  • CouchDB 1.0.2 was built from source.
  • Riak 0.14.1 was installed from the .deb available from Basho.
  • Before each type of test the servers were rebooted to clear the filesystem cache.
  • Tests were run from a 5th node not running Riak or CouchDB. For the 4-node tests, the 5th node ran HAProxy 1.4.8 to round-robin client connections amongst the Riak nodes.

So without further ado…the numbers:

Generate Keys (secs) Show Keys w/ redeemed_count < 50 (secs)
CouchDB 1.0.2 (cold view) 495 74
CouchDB 1.0.2 (computed/warm view) 495 11
Riak 0.14.1 (Raw MapReduce) 1-node 358 82
Riak 0.14.1 (Indexed w/ Raw MapReduce) 1-node 692 65
Riak 0.14.1 (Indexed w/ Key Filter MR) 1-node 692 56
Riak 0.14.1 (Indexed w/ Raw MapReduce) 4-node 1025 40
Riak 0.14.1 (Indexed w/ Key Filter MR) 4-node 1025 34

 

Or if you’re more visual like me:

 

(If you’d like to run the tests yourself, we’ve put the code up: riak_perf_test.py, couchdb_perf_test.py).

 

Analyzing the outcome

One thing that’s very clear is how fast computed views can be in CouchDB (11 seconds flat is nothing to shake a stick at). However, what did we learn from the Riak numbers?

  • Indexed insertion is 91% slower than storing just the key data.
  • MapReduce with indexes is 20% faster than MR on the data keys alone.
  • MapReduce with indexes and key filters is 32% faster than MR on the data keys alone.
  • Adding Riak nodes substantially reduces query time. Adding 3 more nodes speeds up queries by 40%.

It’s not surprising that insertion time doubles with indexes, since you’ve just doubled the number of keys you’re inserting. However, the gains you get can be dramatic. Once the bug with Javascript reduce phases is ironed out, I expect the performance on this test to go even higher (since it will run only one reduce phase instead of the map code multiple times).

What’s a little puzzling is why insertion of keys on a 4-node cluster is 40% slower than a 1-node cluster? I had expected insertion to be the same speed or 25% faster. The reason I’d expected this is because Riak was set to use a write n-value of 3…meaning for every key inserted 3 copies were stored throughout the cluster. Accounting for coordination latency on a 3-node cluster, I’d expect almost the same insertion speed as a 1-node Riak instance. With an extra node in the cluster, I’d expect slightly faster performance since only 3 nodes out of the cluster are engaged in any given write.

Regardless, the query performance proves Riak is a good choice for our use case. While 34 seconds to answer the question is slower than the 11 seconds it took CouchDB, it’s clear that as we scale the cluster our query performance will scale with the size of our dataset. Providing we can find a solution for the 50% slower insertion speed, Riak will definitely be our datastore of choice for this project. Once again, Riak is incredibly impressive at how well it handles large data sets and how adept it’s simple toolset is at answering complex questions.

 

Where we go from here…and nagging questions

The indexed key approach has worked so well for us that we’re currently writing a library to layer on top of txRiak to transparently handle writing/updating indexes. I’ll put up another blog entry on that once it’s posted to Github (there’s a few issues we have to handle like escaping the separators, and we intend to use Riak’s Links to provide an alternate connection between index and data keys). Even more exciting is the news that Basho is currently working on adding native secondary index support to Riak. No news on how that will take shape, but I expect it will blow the performance of our homegrown indexes out of the water. I think built-in indexes are a cleaner more maintainable approach. Maintaining indexes in the client puts a lot of pressure on the clients not to screw up the index accidentally…especially if you’ve got code written in multiple languages accessing that data.

The only real nagging question right now was an issue we saw when we attempted to add a 5th node to the Riak cluster. I had originally intended to do an analysis of how much query performance improved with each node added. However, when the 5th node was added to the cluster it took close to 1 hour for Riak to fully redistribute the keys…and even then 3 of the 5 nodes showed that they were still waiting to transfer one partition each to another node. When we attempted to run the MapReduce index query against the newly expanded cluster, we received MapReduce errors that Riak couldn’t find random index keys as it attempted to pass these “missing” keys into the map phase. I suspect the culprit maybe some “failed node” testing we did before adding the 5th node.

Overall, the takeaway for me is that Riak is a phenomenally flexible data store, and that just because it’s missing a feature doesn’t mean you should shun it for that workload. More often than not, a little thought and chaining together of Riak’s other very powerful tools will give you the same result. To me, Riak vs CouchDB (or vs. SQL) is like a RISC chip vs. a CISC chip. It may not have one complex instruction you need, but you can build that instruction out of much simpler ones that you can run and scale at twice the speed.

Cloud-scale DBs in the cloud…just a quickie

4

Posted by Jason | Posted in DigiTar, Software Development | Posted on 03-17-2010

Tags: , , ,

Just a quick set of thoughts…do cloud-scale DBs save money because they’re based on commodity/cheap servers? Tonight I did some rough back-of-the-pad calculations, and was kind of surprised…

Let’s assume we’ve got an 11TB working set of data, how could we store this redundantly?

(cloud servers in these examples are dedicated servers at a cloud provider)

Option 1: Two beefy storage servers running MySQL in a master/slave config

  • CPU: 4-cores of your favorite CPU vendor
  • RAM: 16GB
  • HDDs: 48x 250 GB SATA
    • Lose 2 for mirrored boot, and 2 for RAID-6 parity
  • Cost:
    • Buy Your Own Hardware (Sun X4500): $50,000 for the pair
    • Host It in the Cloud (SoftLayer): $4,700/month for the pair

Option 2: 28 commodity servers (2 replica copies for each piece of data) running HBase or Cassandra

  • CPU: 4-cores of your favorite CPU vendor
  • RAM: 4GB
  • HDDs: 4x 250 GB SATA
    • Lose 1 for RAID-5 parity (we’ll mingle boot data and data data on the same drive pool)
  • Cost:
    • Buy Your Own Hardware (Dell R410): $43,300 for set of 28
    • Host It in the Cloud (SoftLayer): $12,000/month for the set of 28

Option 3: 42 commodity servers (3 replica copies for each piece of data) running HBase or Cassandra

  • CPU: 4-cores of your favorite CPU vendor
  • RAM: 4GB
  • HDDs: 4x 250 GB SATA
    • Lose 1 for RAID-5 parity (we’ll mingle boot data and data data on the same drive pool)
  • Cost:
    • Buy Your Own Hardware (Dell R410): $64,900 for set of 42
    • Host It in the Cloud (SoftLayer): $18,000/month for the set of 42

Now the issue here that surprised me isn’t the raw cost differential between stuffing your own hardware in your colo or using a cloud provider. And the other thing is, I’m not picking on SoftLayer…Rackspace and Voxel all work out to the same cost scaling as SoftLayer (and in the case of the other two vendors worse).

What surprised me:

  • When you buy your own hardware, “cloud-scale” databases do cost you less (~$7K) than buying beefy storage servers and running MySQL for the same data set.
  • However, when you are at a cloud provider, using cloud-scale databases on “cheap” hardware costs you 3x more than using beefy storage cloud servers running MySQL.

As I said, I’m not comparing the cost of running Option 1 on your own hardware vs. Option 1 at a cloud provider. Yes those costs are more at the cloud provider, but it’s to be expected (they’re bundling in bandwidth, colo, power, and most importantly people to manage the hardware and network).

What’s stunning is that beefy servers at a cloud provider are much more cost efficient. Beefy cloud servers cost you roughly 1/15 of the cost of the hardware every month. Whereas, “cheap” commodity cloud servers cost you roughly 1/3 of the cost of the hardware every month. Much higher mark up on the cheaper volume servers.

Please comment and correct me if I’m wrong in my analysis…I would actually like to be.

Rabbits and warrens.

80

Posted by Jason | Posted in Software Development, Technology | Posted on 01-13-2009

Tags: , ,

20089825.JPG

 

If you like Rabbit and Warrens checkout RabbitMQ in Action in the sidebar.

 

The goal was simple enough: decouple a particular type of analysis out-of-band from mainstream e-mail processing. We started down the MySQL road…put the things to be digested into a table…consume them in another daemon…bada bing bada boom. But pretty soon, complex ugliness crept into the design phase… You want to have multiple daemons servicing the queue?…no problem we’ll just hard code node numbers…what? you want dynamic load re-assignment when daemons join and die?

You get the idea…what was supposed to be simple (decouple something) was spinning its own Gordian knot. It seemed like a good time to see if every problem was looking like a nail (table), because all we had were hammers (MySQL).

A short search later, and we entered the world of message queueing. No, no…we know obviously what a message queue is. Heck, we do e-mail for a living. We’ve implemented all sorts of specialized, high-speed, in-memory queues for e-mail processing. What we weren’t aware of was the family of off-the-shelf, generalized, message queueing (MQ) servers…a language-agnostic, no-assembly required way to wire routing between applications over a network. A message queue we didn’t have to write ourselves? Hold your tongue.

Open up your queue…

Cutting to the chase, over the last 4 years there have been no shortage of open-source message queueing servers written. Most of them are one-offs by folks like LiveJournal to scratch a particular itch. Yeah, they don’t really care what kind of messages they carry, but their design parameters are usually creator-specific (and message persistence after a crash usually isn’t one of them). However, there are three in-particular, that are designed to be highly flexible message queues for their own sake:

Apache ActiveMQ gets the most press, but it appears to have some issues not losing messages. Next.

ZeroMQ and RabbitMQ both support an open messaging protocol called AMQP. The advantage to AMQP is that it’s designed to be a highly-robust and open alternative to the two commercial message queues out there (IBM and Tibco). Muy bueno. However, ZeroMQ doesn’t support message persistence across crashes reboots. No muy bueno. That leaves us with RabbitMQ. (That being said if you don’t need persistence ZeroMQ is pretty darn interesting…incredibly low latency and flexible topologies).

That leaves us with the carrot muncher…

RabbitMQ pretty much sold me the minute I read “written in Erlang”. Erlang is a highly parallel programming language developed over at Ericsson for running telco switches…yeah the kind with six bazillion 9s of uptime. In Erlang, its supposedly trivial to spin off processes and then communicate between them using message passing. Seems like the ideal underpinning for a message queue no?

Also, RabbitMQ supports persistence. Yes Virginia, if your RabbitMQ dies, your messages don’t have to die an unwitting death…they can be reborn in your queues on reboot. Oh…and as is always desired @ DigiTar, it plays nicely with python. All that being said, RabbitMQs documentation is well…horrible. Lemme rephrase, if you already understand AMQP, the docs are fine. But how many folks know AMQP? It’d be like MySQL docs assuming you knew some form of SQL…er…nevermind.

So, without further ado…here is a reduction of a weeks’ worth of reading up on AMQP and how it works in RabbitMQ…and how to play with it in Python:

Playing telephone

There are four building blocks you really care about in AMQP: virtual hosts, exchanges, queues and bindings. A virtual host holds a bundle of exchanges, queues and bindings. Why would you want multiple virtual hosts? Easy. A username in RabbitMQ grants you access to a virtual host…in its entirety. So the only way to keep group A from accessing group B’s exchanges/queues/bindings/etc. is to create a virtual host for A and one for B. Every RabbitMQ server has a default virtual host named “/”. If that’s all you need, you’re ready to roll.

Exchanges, Queues and bindings…oh my!

Here’s where my railcar went off the tracks initially. How do all the parts thread together?

Queues are where your “messages” end up. They’re message buckets…and your messages sit there until a client (a.k.a. consumer) connects to the queue and siphons it off. However, you can configure a queue so that if there isn’t a consumer ready to accept the message when it hits the queue, the message goes poof. But we digress…

The important thing to remember is that queues are created programmatically by your consumers (not via a configuration file or command line program). That’s OK, because if a consumer app tries to “create” a queue that already exists, RabbitMQ pats it on the head, smiles gently and NOOPs the request. So you can keep your MQ configuration in-line with your app code…what a concept.

OK, so you’ve created and attached to your queue, and your consumer app is drumming its fingers waiting for a message…and drumming…and drumming…but alas no message. What happened? Well you gotta pump a message in first! But to do that you’ve got to have an exchange…

Exchanges are routers with routing tables. That’s it. End stop. Every message has what’s known as a “routing key”, which is simply a string. The exchange has a list of bindings (routes) that say, for example, messages with routing key “X” go to queue “timbuktu”. But we get slightly ahead of ourselves.

Your consumer application should create your exchanges (plural). Wait? You mean you can have more than one exchange? Yes, you can, but why? Easy. Each exchange operates in its own userland process, so adding exchanges, adds processes allowing you to scale message routing capacity with the number of cores in your server. As an example, on an 8-core server you could create 5 exchanges to maximize your utilization, leaving 3 cores open for handling the queues, etc.. Similarly, in a RabbitMQ cluster, you can use the same principle to spread exchanges across the cluster members to add even more throughput.

OK, so you’ve created an exchange…but it doesn’t know what queues the messages go in. You need “routing rules” (bindings). A binding essentially says things like this: put messages that show up in exchange “desert” and have routing key “ali-baba” into the queue “hideout”. In other words, a binding is a routing rule that links an exchange to a queue based on a routing key. It is possible for two binding rules to use the same routing key. For example, maybe messages with the routing key “audit” need to go both to the “log-forever” queue and the “alert-the-big-dude” queue. To accomplish this, just create two binding rules (each one linking the exchange to one of the queues) that both trigger on routing key “audit”. In this case, the exchange duplicates the message and sends it to both queues. Exchanges are just routing tables containing bindings.

Now for the curveball: there are multiple types of exchanges. They all do routing, but they accept different styles of binding “rules”. Why not just create one type of exchange for all style of rules? Because each rule style has a different CPU cost for analyzing if a message matches the rule. For example, a “topic” exchange tries to match a message’s routing key against a pattern like “dogs.*”. Matching that wildcard on the end takes more CPU than simply seeing if the routing key is “dogs” or not (e.g. a “direct” exchange). If you don’t need the extra flexibility of a “topic” exchange, you can get more messages/sec routed if you choose the “direct” exchange type. So what are the types and how do they route?

Fanout Exchange – No routing keys involved. You simply bind a queue to the exchange. Any message that is sent to the exchange is sent to all queues bound to that exchange. Think of it like a subnet broadcast. Any host on the subnet gets a copy of the packet. Fanout exchanges route messages the fastest.

Direct Exchange – Routing keys are involved. A queue binds to the exchange to request messages that match a particular routing key exactly. This is a straight match. If a queue binds to the exchange requesting messages with routing key “dog”, only messages labelled “dog” get sent to that queue (not “dog.puppy”, not “dog.guard“…only “dog”).

Topic Exchange – Matches routing keys against a pattern. Instead of binding with a particular routing key, the queue binds with a pattern string. The symbol # matches one or more words, and the symbol * matches any single word (no more, no less). So “audit.#” would match “audit.irs.corporate”, but “audit.*” would only match “audit.irs”. Our friends at RedHat have put together a great image to express how topic exchanges work:

AMQP Stack Diagram

Source: RabbitMQ in Action (by me and a very cool Uruguayan dude…Mr. Alvaro)

Persistent little bugger…

You spend all that time creating your queues, exchanges and bindings, and then BANG!…the server fries faster than the griddle at McDonald’s. All your queues, exchanges and bindings are there right? Oh geez…what about the messages in the queues you hadn’t serviced yet?

Relax, providing you created everything with the default arguments, it’s all gone…poof…whoosh…nada…nil. That’s right, RabbitMQ rebooted as empty as a baby’s noggin. You gotta redo everything kemosabe. How do you keep this from happening in the future?

On your queues and your exchanges there’s a creation-time flag called “durable”. There’s only one thing durable means in AMQP-land…the queue or exchange marked durable will be re-created automatically on reboot. It does not mean the messages in the queues will survive the reboot. They won’t. So how do we make not only our config but messages persist through a reboot?

Well the first question is, do you really want your messages to persist? For a message to last through a reboot, it has to be written to disk, and even a simple checkpoint to disk takes time. If you value message routing speed more than the contents of the message, don’t make your messages persistent. That being said, for our particular needs @ DigiTar, persistence is important.

When you publish your message to an exchange, there’s a flag called “Delivery Mode”. Depending on the AMQP library you’re using there will be different ways of setting it (we’ll cover the Python library later). But the long and the short of it is you want the “Delivery Mode” set to the value 2, which means “persistent”. “Delivery Mode” usually (depending on your AMQP library) defaults to a value of 1, which means “non-persistent”. So the steps for persistent messaging are:

  1. Mark the exchange “durable”.
  2. Mark the queue “durable”.
  3. Set the message’s “delivery mode” to a value of 2

That’s it. Not really rocket science, but enough moving parts to make a mistake and send little Sally’s dental records into cyber-Nirvana.

There may be one thing nagging you though…what about the binding? We didn’t mark the binding “durable” when we created it. It’s alright. If you bind a durable queue to a durable exchange, RabbitMQ will automatically preserve the binding. Similarly, if you delete any exchange/queue (durable or not) any bindings that depend on it get deleted automatically.

Two things to be aware of:

  • RabbitMQ will not allow you to bind a non-durable exchange to a durable queue, or vice-versa. Both the exchange and the queue must be durable for the binding operation to succeed.
  • You cannot change the creation flags on a queue or exchange after you’ve created it. For example, if you create a queue as “non-durable”, and want to change it to “durable”, the only way to do this is to destroy the queue and re-create it. It’s a good reason to double check your declarations.

Food for snakes

A real empty area for AMQP usage is using it in Python programs. For other languages there are plenty of references:

But for little old Python, you need to dig it out yourself. So other folks don’t have to wander in the wilderness like I did, here’s a little primer on using Python to do the AMQP-tasks we’ve talked about:

First, you’ll need a Python AMQP library…and there are two:

  • py-amqplib – General AMQP library
  • txAMQP – An AMQP library that uses the Twisted framework, thereby allowing asynchronous I/O.

Depending on your needs, py-amqplib or txAMQP may be more to your liking. Being Twisted-based, txAMQP holds the promise of building super performing AMQP consumers that use async I/O. But Twisted programming is a topic all its own…so we’re going to use py-amqplib for clarity’s sake. UPDATE: Please check the comments for example code showing use of txAMQP from Esteve Fernandez.

AMQP supports pipelining multiple MQ communication channels over one TCP connection, where each channel is a communication stream used by your program. Every AMQP program has at least one connection and one channel:

from amqplib import client_0_8 as amqp
conn = amqp.Connection(host="localhost:5672 ", userid="guest",
password="guest", virtual_host="/", insist=False)
chan = conn.channel()

Each channel is assigned an integer channel number automatically by the .channel() method of the Connection() class. Alternately, you can specify the channel number yourself by calling .channel(x) , where x is the channel number you want. More often than not, its a good idea to just let the .channel() method auto-assign the channel number to avoid collisions.

Now we’ve got a connection and channel to talk over. At this point, our code is going to diverge into two applications that use that same bit we’ve created so far: a consumer and the publisher. Let’s create the consumer app by creating a queue named “po_box” and an exchange named “sorting_room”:

chan.queue_declare(queue="po_box", durable=True,
exclusive=False, auto_delete=False)
chan.exchange_declare(exchange="sorting_room", type="direct", durable=True,
auto_delete=False,)

What did that do? First, it created a queue called “po_box” that is durable (will be re-created on reboot) and will not be automatically deleted when the last consumer detaches from it (auto_delete=False). It’s important to set auto_delete to false when making a queue (or exchange) durable, otherwise the queue itself will disappear when the last consumer detaches (regardless of the durable flag). Setting both durable and auto_delete to true, would make a queue that would be recreated only if RabbitMQ died unexpectedly with consumers still attached.

(You may have noticed there’s another flag specified called “exclusive”. If set to true, only the consumer that creates the queue will be allowed to attach to it. It’s a queue that is private to the creating consumer.)

There’s also the exchange declaration for the “sorting_room” exchange. auto_delete and durable mean the same things as they do in a queue declaration. However, .exchange_declare() introduces an argument called type that defines what type of exchange you’re making (as described earlier): fanout, direct or topic.

At this point, you’ve got a queue to receive messages and an exchange to publish them to initially…but we need a binding to link the two together:

chan.queue_bind(queue="po_box", exchange="sorting_room",
routing_key="jason")

The binding is pretty straight forward. Any messages arriving at the “sorting_room” exchange with the routing key “jason” gets routed to the “po_box” queue.

Now, there’s two methods of getting messages out of the queue. The first is to call chan.basic_get() to pull the next message off the queue (if there are no messages waiting on the queue, chan.basic_get() will return a None object…thereby blowing up the print msg.body code below if not trapped) :

msg = chan.basic_get("po_box")
print msg.body
chan.basic_ack(msg.delivery_tag)

But what if you want your application to be notified as soon as a message is available for it? To do that, instead of chan.basic_get(), you need to register a callback for new messages using chan.basic_consume():

def recv_callback(msg):
print 'Received: ' + msg.body
chan.basic_consume(queue='po_box', no_ack=True,
callback=recv_callback, consumer_tag="testtag")
while True:
chan.wait()
chan.basic_cancel("testtag")

chan.wait() is looped infinitely, which is what causes the channel to wait for the next message notification from the queue. chan.basic_cancel() is how you unregister your message notification callback. The argument specifies the consumer_tag you specified in the original chan.basic_consume() registration (that’s how it figures out which callback to unregister). In this case chan.basic_cancel() never gets called due to the infinite loop that precedes it…but you need to know about it, so it’s in the snippet.

The one additional thing you should pay attention to in the consumer is the no_ack argument. It’s accepted on both chan.basic_get() and chan.basic_consume() and defaults to false. When you grab a message off a queue, RabbitMQ needs you to explicitly acknowledge that you have it. If you don’t, RabbitMQ will re-assign the message to another consumer on the queue after a timeout interval (or on disconnect by the consumer that initially received it without ack’ing it). If you set the no_ack argument to true, then py-amqplib will add a “no_ack” property to your AMQP request for the next message. That will instruct the AMQP server to not expect an acknowledgement for that get/consume. However, in most cases, you probably want to send the acknowledgement yourself (e.g. you need to put the message contents in a database before you acknowledge). Acknowledgements are done by caling the chan.basic_ack() method, using the delivery_tag property of the message you’re acknowledging as the argument (see the chan.basic_get() code snippet above for an example).

That’s all she wrote for the consumer. (Download: amqp_consumer.py)

But what good is a consumer, if nobody is sending it messages? So you need a publisher. The code below will publish a simple message to the “sorting_room” exchange and mark it with the routing key “jason”:

msg = amqp.Message("Test message!")
msg.properties["delivery_mode"] = 2
chan.basic_publish(msg,exchange="sorting_room",routing_key="jason")

You may notice that we set the delivery_mode element of the message’s properties to “2”. Since the queue and exchange were marked durable, this will ensure the message is sent as persistent (i.e. will survive a reboot of RabbitMQ while it is in transit to the consumer).

The only other thing we need to do (and this needs to be done on both consumer and publisher apps), is close the channel and connection:

chan.close()
conn.close()

Pretty simple, no? (Download: amqp_publisher.py)

Giving it a shot…

Now we’ve written our consumer and publisher, so let’s give it a go. (This assumes you have RabbitMQ installed and running on localhost.)

Open up the first terminal, and run python ./amqp_consumer.py to get the consumer running and to create your queues, exchanges and bindings.

Then run python ./amqp_publisher.py “AMQP rocks.” in a second terminal. If everything went well, you should see your message printed by the consumer on the first terminal.

Taking it all in

I realize this has been a really fast run through AMQP/RabbitMQ and using it from Python. Hopefully, it will fill in some of the holes of how all the concepts fit together and how they get used in a real Python program. If you find any errors in my write-up, I’d very much appreciate it if you’d please let me know (williamsjj@digitar.com). Similarly, I’d be happy to answer any questions that I can. Next up….clustering! But I’ve got to figure it out first. :-)

NB: Special thanks to Barry Pederson and Gordon Sims for correcting my understanding of no_ack’s operation and for catching syntactically incorrect Python code I missed.

NB: My knowledge on the subject was distilled from these sources, which are excellent further reading:

Viva la storage.

0

Posted by Jason | Posted in DigiTar, Solaris, Technology | Posted on 11-10-2008

Coming soon… ;-)

HSPA (AT&T) vs. EV-DO (Verizon)

3

Posted by Jason | Posted in Technology | Posted on 05-30-2008

Some folks hate to be offline, and some folks can't afford to be. I suppose I fit somewhere in between. About a month ago, I realized I was going to be doing some significant traveling…probably nowhere near a decent WiFi access point. Thus arose the question…how do you connect back to the office regardless of where your derrière happens to be? There were only a couple of minor requirements:

  • (Good) National 3G (U.S.A.) coverage
  • Minimum top end throughput around 1Mb/s
  • ExpressCard form factor (nothing sexier than a wrist-sized dongle cantilevered off your USB port)
  • Support for Mac OS X

Folks that know me probably are stunned at the last one. As of April 29th I kicked the Dell habit. My regular target of abuse is now a MacBook Pro. But that's a whole other story…

Anywho, those req's really narrowed it down to two players: AT&T and Verizon. Both offer national 3G access at speeds of 1Mb/s or greater. But they take two different approaches to it…

HSPA (High Speed Packet Access)

High Speed Packet Access is really the joining of two different 3G GSM protocols: HSDPA and HSUPA (the D and the U are “downlink” and “uplink” respectively). On AT&T's network, HSPA should give you average speeds around 1.8 Mb/s down and 800 Kb/s up. My experience has been that this is true across their network…as long as you can get a 3G signal. In fact, in some areas (LA and San Antonio) it wasn't uncommon for me to get around 2.2-2.5 Mb/s down. With tower upgrades coming early next year, the downlink speed should boost further to about 7.2 Mb/s. Overall, pretty darn good for no leash. Factor in the fact that HSPA is a 3G GSM standard widely deployed across Europe/Japan and suddenly you've got a great data solution worldwide (an issue given some upcoming trips). Oh, I forgot to mention…some places in Europe have already deployed 14.4Mb/s HSDPA (HSUPA deployment is somewhat spottier).

Compared to EV-DO, HSPA also has some design advantages. For example, both EV-DO and HSPA time slice transmission to connected clients, but HSPA can transmit to 10 clients in single time slice, whereas EV-DO can only transmit to one client per time slice. Also, HSPA towers possess the capability to figure out which clients have the best signal quality and will transfer bandwidth capacity from clients who can't use it (bad signal) to clients that can (excellent signal). Of course, even with all of its advantages, the HSPA network is being run by AT&T…and they could screw up implementation of a PB&J sandwich…

EV-DO (EVolution – Data Optimized)

Like HSPA, EV-DO is a CDMA-based 3G protocol. Unlike HSPA however, it is not a GSM body standard and is instead the successor to CDMA2000. So, outside of the U.S.A, Korea, areas of Japan and piroshki stands in the former Soviet-bloc you're pretty much out of luck for access. However, it does provide 1Mb/s speeds regularly. Upload speeds are in the 200-500 Kb/s range.

With that brief understanding, I motored down to the Verizon and AT&T stores and picked up service with both companies (AT&T and Verizon have 30-day refund and new service cancellation policies).

Behind door number 1…

For a couple of years now, I've heard phenomenal things about Verizon's BroadbandAccess (EV-DO) service. People seemed to rave about it's coverage and reliability…and they're right. Verizon's biggest plus is it's consistency. It may not be as fast or have as low latency when AT&T is on the ball, but they'll deliver the same service levels every time you power up. I don't care if I was in Boise, LA, or San Antonio, Verizon delivered 800-1000 Kb/s throughput and 230ms latency like clockwork. Sometimes it was a bit better or a bit worse, but only by about 10% (exception was the trip up the coast to Malibu where Verizon dropped down to 2.5G service and AT&T was nowhere to be seen).

The other nice thing about Verizon is the Novatel V740 ExpressCard. It has excellent support in OS X. Pop it in and OS X's built-in WWAN manager configures the card, activates it with Verizon and away you go. No special software to install. You even get a nice little signal strength meter on the task bar (yeah…yeah…taskbar…Windows habits die hard ;-) ).

The gal you really wanna take home to Mom…

I wanted AT&T to be the best…honest. I'm a current AT&T Wireless voice customer and love the phones, the stores and the service. However, 3G service with AT&T lacks Verizon's consistency. Initially, 3G coverage could be hit and miss when I started 4 weeks ago. However, their big push for blanket 3G coverage in advance of the 3G iPhone launch has improved the 3G network dramatically in the last week. Although the coverage is spot on now, the service level is not in terms of latency. Throughput however is phenomenal. 1500-1800 Kb/s downlink speeds over 90% of the time, with solid 2200-2500 Kb/s in areas with the latest tower gear. So for the majority of applications, AT&T LaptopConnect is a superior solution to Verizon. But…not for me.

Quite a bit of the remote work I do involves either SSH or Windows Remote Desktop over VPN. There's few things more annoying than mistyping a command and waiting for the refresh to catch up so you can go correct it. As a result, better latency means a happier camper 'round these parts. That's not to say that AT&T's latency is awful. In fact, it's better than Verizon about 80% of the time when you measure it. So why am I complaining about it? Well, 150ms latency is only good if it stays at 150ms. AT&T's deployment of HSPA causes latency spikes regularly, particularly under load. As a result, I started doing an combo test on both services…load a YouTube video and concurrently check the ping over a VPN tunnel. If you try it, you'll see both Verizon and AT&T's latency spike dramatically. Hmm..you're probably thinking, “so AT&T is better than Verizon both with and without heavy load…why won't you say its better?”. Because, it doesn't feel faster. It was really hard to put a metric on this, because while the measurements were better on AT&T, the lag while typing on an SSH connection always felt a little (to a lot) bit slower. In fact, I kept reminding myself that this had to be in my head, because the ping measurements were better than Verizon. Then while in San Antonio I tried using Skype.

San Antonio expectedly has the best coverage of any AT&T area I've been in. Consistent throughput above 2200 Kb/s and latency below 150ms. So imagine my surprise when my SSH sessions seemed laggy, and the Skype calls would start great and then break down within about 3-4 minutes. You could hear the person on the other end of the call fine, but they started having issues hearing me and my video would lock up for them. If you turn the video off it'd buy you another 4-5 minutes before the call went haywire. So back in went the Verizon card. Bang. Perfect SSH sessions. Crystal clear call quality on Skype…and the folks on the other end said not only was the video smooth but the quality of the picture was better (Skype must adjust video quality based on connection quality). A 45-minute Skype call completed with no audio or video issues on Verizon.

I tried the Skype exercise about 3-4 times over a 48-hour period with the same results. Every time I'd give AT&T a shot, and every time I'd have to drop in the Verizon card to complete a decent conversation. This bodes not well for the rumor that the 3G iPhone will take advantage of HSUPA for video conferencing. On the positive side, those consistent 2200 Kb/s AT&T downlink speeds meant I was able to suck down the OS X 10.5.3 system update (420MB) in about 30 minutes (~1500 Kb/s sustained average).

The other major issue with AT&T is the Option GT Ultra Express card. On the positive side, it supports HSUPA so you can take advantage of fast uplink speeds. Unfortunately, it isn't supported natively by the OS X WWAN subsystem (unlike it's unavailable predecessor, the Option GT 3.6 Express which is natively supported). So you have to install Option's GlobeTrotter software, which isn't a slick as the native support and frankly feels poorly built. A lot of folks on the Apple and AT&T forums have also complained about GlobeTrotter frequently crashing for them. To some degree I suspect the inconsistent performance I get from AT&T (despite the metrics) might be due to GlobeTrotter. There's also Launch2Net by NovaMedia, which provides 3rd party drivers for the GT Ultra Express. Still not native, and amazingly Launch2Net axes the native WWAN utilities that the Verizon card leverages (Launch2Net got uninstalled faster than Vista on a 286). Supposedly, OS X 10.5.3 was going to include native support for the GT Ultra Express, but as of 10.5.3's release yesterday…no dice.

Lastly, there's price. Both AT&T and Verizon charge $60/month. However, AT&T's service is unlimited where Verizon's service is 5GB/month (and $0.49/MB over that).

End of the road…

So where does that leave us? If you need reliable latency and pretty darn good speed, Verizon is your best bet in my opinion. On the other hand, if the majority of your remote work involves the web, e-mail or anything else that's not latency sensitive, AT&T is far superior and will allow global roaming. Frankly, I'm kind of anxious to hear from someone who has the GT Ultra Express on a Windows machine to find out if the inconsistent performance I experienced was specific to GlobeTrotter for Mac. Personally, I'm going to keep both services. There were a handful of times that Verizon's latency was abysmal, but AT&T's was great. Enough that I realized in an emergency I'd really need to have the option of either service.

Here's hoping AT&T's 3G latency improves…and that Apple gets with the program and includes native support for the Option GT Ultra Express…the 3G ExpressCard of choice for Apple's carrier of choice. Sorry that this post blathered on a bit long. I hope this saves other folks from having to do this much evaluation legwork.

(Here is the XLS sheet with observed metrics for both services: AT&T vs. Verizon Benchmarks )

Remember the Alamo…

0

Posted by Jason | Posted in DigiTar, Solaris, Technology | Posted on 05-28-2008

Tomorrow (05/28/2008) I'm giving a talk on moving to open storage (i.e. ethernet, OpenSolaris and SATA…in no particular order) at the Diocesan Information Systems Conference in San Antonio. It's a closed event, but here are the slides from the talk…including the talking notes which cover a lot more than I'll probably have time for:

PDF
Slideshare

Democratizing Storage

3

Posted by Jason | Posted in DigiTar, Solaris | Posted on 04-21-2008

Opensolaris_logo_trans

As a company that was heavily populated with Linux zealots, it’s been surreal for us to watch OpenSolaris develop for the past 3 years. While technologies like DTrace and FMA are features we now use everyday, it was storage that brought Solaris into our environment and continues to drive it deeper into our services stack. Which begs the question: Why? Isn’t DTrace just as cool as ZFS? Haven’t Solaris Containers dramatically changed the way we provision and utilize systems? Sure…but storage is what drives our business and it doesn’t seem to me that we’re alone.

Everything DigiTar does manipulates or massages messaging in some way. When most people think of what drives our storage requirements they think of quarantining or archiving e-mail. But when you’re dealing with messages that can make or break folks’ businesses, logging the metadata is perhaps the most important thing we do.

Metadata is flooding in every second. It’s at the center of everything from proving a message was delivered to ensuring we meet end-to-end processing times and SLAs. If we didn’t quarantine any more messages, we’d still generate gigabytes of data every day that can’t be lost. Without reliable and scalable storage we wouldn’t exist.

Lost IOPs, Corruption and Linux…oh my!

What got us using OpenSolaris was Linux’s (circa 2005) unreliable SCSI and storage subsystems. I/Os erroring out on our SAN array would be silently ignored (not retried) by Linux, creating quiet corruption that would require fail-over events. It didn’t affect our customers, but we were going nuts managing it. When we moved to OpenSolaris, we could finally trust that no errors in the logs literally meant no errors. In a lot of ways, Solaris benefits from 15 years of making mistakes in enterprise environments. Solaris anticipates and safely handles all of the crazy edge cases we’ve encountered with faulty equipment and software that’s gone haywire.

When it comes to storing data, you’ll pry OpenSolaris (and ZFS) out of our cold dead hands. We won’t deploy databases on anything else.

Liberation Day

While we moved to Solaris to get our derrières out of a sling, being on OpenSolaris has dramatically changed the way we use and design storage.

When you’ve got rock-solid iSCSI, NFS, and I/O multipathing implementations, as well as a file system (ZFS) that loves cheap disks…and none of it requires licensing…you can suddenly do anything. Need to handle 3600 non-cached IOPs for under $60K? No problem. Have an existing array but can’t justify $10K for snapshotting? No problem. How ‘bout serving line-rate iSCSI with commodity storage and CPUs? No problemo.

That’s the really amazing thing about OpenSolaris as a storage platform. It has all of the features of an expensive array and because it allows you to build reliable storage out of commodity components, you can build the storage architecture you need instead of being held hostage by the one you can afford. But features like ZFS don’t mandate that you change your architecture. You can pick and choose the pieces that fit your needs and make any existing architecture better too.

So how has OpenSolaris changed the way DigiTar does storage? For one thing, it’s enabled us to move almost entirely off of our fibre-channel SAN. We get better performance for less money by putting our database servers directly on Thumpers (Sun Fire X4500) and letting ZFS do its magic. Also, because its ZFS, we’re assured that every block can be verified for correctness via checksumming. By doing application-level fail-over between Thumpers, we get shared-nothing redundancy that has increased our uptime dramatically.

One of the things that always has bugged me about traditional clustering is its reliance on shared storage. That’s great if the application didn’t trash its data while crashing to the ground. But what if it did? To replicate the level of redundancy we get with two X4500s, we’d have to install two completely separate storage arrays…not to mention also buy two very large beefy servers to run the databases. By using X4500s, we get the same reliability and redundancy for about 85% less cost. That kind of savings means we can deploy 6.8x more storage for the same price footprint and do all sorts of cool things like:

  • Create multiple data warehouses for data mining spam and mal-ware trends.
  • Develop and deploy new service features whenever we want without considering storage costs.
  • Be cost competitive with competitors 10x our size.

Whether you’re storing pictures of your kids, or archiving business critical e-mail (or anything in between), it seems to me that being able to store massive amounts of data reliably is as fundamental to computing today as breathing is to living. OpenSolaris allows us as a company to stop worrying about what its going to cost to store the results of our services, and focus on what’s important: developing the services and features themselves. When you stop focusing on the cost of “air”, you’re liberated to actually make life incredible.

I could continue blathering about how free snapshotting (both in terms of cost and performance hit) can allow you to re-organize your backup priorities, or a bunch of other very cool benefits of using OpenSolaris as your storage platform. But you should give it a shot yourself, because OpenSolaris’ benefits are as varied and unique as your environment. Once you give it a try, I think you’ll be hard pressed to go back to vendor lock-in…but I’m probably a bit biased now.  I think you’ll also find an community around OpenSolaris that is by far the friendliest and most mature open source group of folks you’ve ever dealt with.

Pretty on the inside…working with the UI on the AX2200.

0

Posted by Jason | Posted in Technology | Posted on 04-11-2008

It's nice when you boot an appliance and the web user interface doesn't look like it was designed by a guy who thought Jurassic Park and The Net were the pinnacle of UI design. The A10 Advanced Core OS (ACOS) has an incredibly polished look to the WebUI. Frankly, its beautiful. All chrome and glass so to speak…

 

Overall, the Web UI is very easy to navigate and options are not buried more than 2 clicks deep. However, there are two areas where the ACOS Web UI is absolutely a pain in the rear:

  • Grid-metaphor editing.
  • Heinous layout for the relationship between physical interfaces, VLANs and virtual interfaces.

Grid Editing 

One of the most common day-to-day tasks we end up doing with a load balancer is enabling/disabling a batch of real servers for upgrade. Generally, we want to:

  1. Disable real servers A, C, and E. (Leaving B & D enabled).
  2. Upgrade A, C, and E.
  3. Swap A, C, and E back into battery and take B & D out.
  4. Upgrade B & D.
  5. Put B & D back in.

This is a perfect application where you want to be able to pull up the settings for multiple entries in one edit table. With the settings for real servers A,B,C,D, and E up on the same page, you can change all of the applicable settings all at the same time, verify each server is correct, then bam!…slam the new settings into place all at once. Unfortunately, this is not possible with the ACOS Web UI. The only thing you can do to multiple entries at once is delete them:

 

But simple maintenance of real server status is not the only place with the table editing metaphor is helpful. It is indispensable when trying to balance which VLANs are on which physical ports. Having to drill into an entry, make the change, and then re-examine the grid view to see how it looks is very tedious. It's much easier to pull up all the necessary interface/VLAN assignments on one view, edit them in-place and then apply them with a single-click once they look right. It seems that the goal of any good Web UI should be  to minimize round trips and enable batch application as much as possible. This was an area where the Nauticus/Sun Web UI was phenomenal. Any grid view could be turned into an edit table. On the other hand, if you only selected one entry to edit, the Nauticus Web UI was smart enough to reformat the one entry into a single column of editable values (so it fit horizontally without scrolling). Quickly swapping batches of real servers in and out of service is not a task we're looking forward to with the AX2200.

Network Relationships & Just Being Friendly 

This is not an uncommon metaphor for dealing with VLANs and the IP interfaces that sit on them:

  • VLANs entries belong to physical interfaces.
  • Virtual IP interfaces are created and belong to specific VLANs entries.

To A10s credit, it's a familiar metaphor that is instantly accessible, and they even kept the ve0, ve1 virtual interface naming convention that's common to Cisco and Foundry equipment. Where they went wrong is not making it easy to tag a friendly name onto the VLAN and virtual interface entries.

What's the purpose of VLAN 1234? Well it's attached virtual interface ve0…that's helpful. What on God's Green Earth does ve0 serve? You can't tell easily from the VLAN page. You either have to dig out your documentation, or open the virtual interface list in a separate window:

 

The simple solution on Foundry and Nauticus/Sun gear was what you could call “friendly names”: A simple user description for each VLAN, interface and virtual interface. Can't remember what VLAN 1234 does…no problem…it's friendly name says “tier1_realservers”. Oh! That's right, VLAN 1234 contains the application servers for tier 1 of our application and ve0 is the virtual interface that serves that subnet. Toggling back and forth between tabs in Firefox for VLANs and virtual interfaces while setting up the test AX2200 has been a barrel of monkeys. Frankly, “friendly” or “vanity” names should be able to be attached to any type of entry whether it's a real server, a physical interface, or an SSL certificate.

Other nits so far: 

  • Appliance will not boot if hard drives are not in exactly the same slots as shipped (not expected for a RAID-1 setup).
  • Can't find a mechanism in the Web UI to generate a CSR.
  • Can't find a way to import a PEM file (Must import certificate and key file separately.)
  • There doesn't appear to be a way to load certificates and keys by pasting them into a text box.
  • Host name is not notated at the top of the GUI and in the page title at all times to help identify which box you're in.
  • Virtual interfaces that are already in use still show up in the VLAN creation screen as assignable. Only when “Apply” is clicked does a JavaScript alert box tell you there's an issue.
  • Physical front panel status light only blinks when there's a problem. Does not turn amber or red. Very unnoticeable if you don't already know there's an issue.
  • Showing system interfaces via the CLI is “sh int” instead of “sh sys int” on Foundry gear.

The last one is not something normally you'd complain about. All networking vendors seem to do it differently. However, given the fact that A10 is staffed with so many ex-Foundry Networks folks, and the fact that the ACOS CLI is identical to Ironware in so many areas, it's an unwelcome surprise when “sh sys int” errors out while you're in the CLI.

Needless to say, we're still talking about the AX2200, so we're fairly happy with what we've seen so far. However, “friendly naming” and table editing  really need to be fixed in an upcoming version of ACOS. The current way of doing things is probably only acceptable in very small environments where the boxes don't get touched very much. This weekend is dedicated to SLB testing…so hopefully more advanced configuration is where the Web UI really comes together.

That's all that's fit to print as they say.

Technorati Tags: , , , ,

Lost in wonderland again…unboxing the AX2200

1

Posted by Jason | Posted in Technology | Posted on 04-07-2008

It's been nearly two years since we ventured into the wonderland of replacing our Alteon gear with the Sun N1216. It was a big risk because load balancers are interlaced tightly with our multi-phased mail logistics architecture. To say the least, we have not been disappointed. The Sun N1216 series is by far the best load balancer we've ever worked with. Almost limitless power (~3Gbps) for a $25K list price. (Its big brother the N2120 was the Bugatti Veyron of the load balancer world.) But more than power, the N series provides an incredibly elegant and powerful virtualization that is irreplaceable. It enabled us to reduce what were multiple pairs of Alteons down to a single pair of N1216s running multiple virtual load balancer instances.

But what blew us away was a very simple feature we'll call “assignable virtual IP address (VIP)”. Assignable VIP functionality allows you to create two virtual load balancers (internal and external) with no routing in common, and attach your real servers to one (internal), while advertising the VIP on the other (external). Because there is no routing path between them (all traffic hitting the VIP is essentially memory copied to the internal load balancer for SLB processing), no servers sitting in your DMZ can compromise or talk directly to your real servers. They simply can't talk to something that there's no routing path to. As a result, you have a separate clean management path to your real servers that is entirely inside your trusted network, and incredibly simplifies your topology (no ACLs!). It is by far the best application of virtualization in a network device we've ever seen. However, the halcyon days came to an end in April of 2007 when we were informed that Sun intended to EOL the entire N series and shutdown the load balancing group they had acquired with Nauticus. Given that there were no other products on the market in April of 2007 that could even remotely drop seamlessly into our new topology, we decided  to wait and see what Sun might do next.

A year later not much has changed, and Sun still doesn't have a coherent strategy on load balancing to replace the N series. While our units would continue to be supported for the next 5 years, there won't be software updates, and definitely no updates to the phenomenal FPGAs that make the box scream. There are flaws in the N series that need bug updates…things that would be livable if they were going to be fixed. But in a production environment no bug fixes is simply not an acceptable strategy. So we're back in wonderland…

To cut to the chase, we talked with all the major vendors and settled down to F5, Citrix/NetScaler, and Cisco. Only Cisco, with their ACE platform, has any virtualization story whatsoever. Everyone else has no virtualization plans that they're telling their sales dudes about. All 3 can cobble together an inelegant and obfuscated configuration to allow us to maintain our topology and security stance, but none can do the “assignable VIP” magic that made Sun/Nauticus such an amazing application of virtualization and so clean to administer.

In the middle of all this, a trusted friend at Sun recommended we take a look at a new load balancing company, A10 Networks. Now A10 doesn't have virtualization in their platform today, and they definitely don't have “assignable VIP”. But they have a story and roadmap that will make any Sun/Nauticus customer get a big silly grin on their face. You'll have to talk to A10 to find out the particulars. ;-)  

What does A10 have? A phenomenal architecture on paper, and sane licensing. While FPGAs are what made the Nauticus design scream, being entirely FPGA and ASIC driven was also what drove the cost of bug fixes up. It was difficult for them to add L4/L7 features at the same rate that F5 and others were, because it usually required a modification of the FPGA layout. Enter what appears to be a brilliant design compromise and excellent capitalization on the Intel/AMD race for core count. The A10 AX2200 and above have L2/L3 ASICs, SSL ASICs,  and a L4/L7 traffic director FPGA. The FPGA dynamically assigns new connections to each of the box's 4-8 Xeon cores for full L4/L7 processing. Also, each core operates independently from the others. That is to say, there is no contention or synchronization penalties for using more cores. Add more connections and the traffic FPGA evenly distributes them among the cores, and stitches the results back together for the client. Near perfect parallelization. All of the heavy L4/L7 lifting is done entirely in software on generic Xeon cores. This allows A10 to quickly add the complex features (like F5) that would have required an FPGA modification on the Nauticus gear. The excellent parallelization model ensures the performance hit encountered by using generic CPUs instead of FPGAs can be made up for linearly by adding buckets of cores. The FPGA is therefore much simpler in design than what Nauticus required. But as I said, this is all on paper.

However, it is an equally seductive design to what Nauticus created. F5, NetScaler and Cisco all have L2/L3 ASICs in their boxes but nothing really significant in terms of hardware acceleration in the L4/L7 areas (F5 does have their L4 ASIC that does provide good acceleration of basic L4 TCP termination load balancing). So we've decided to leap again and take a chance with A10. Also, A10 includes Global Server Load Balancing for free and does not engage in F5's hideous practice of licensing HTTP compression and SSL offload capacity by the MB/s…oh and A10 has TCL-based aRules. ;-)

So we eagerly awaited the FedEx guy on Thursday to deliver our new pair of AX2200s for validation testing. With a 100lb thump they landed solidly on our testing table, and a couple flicks of a box cutter later…


A10 Networks AX2200 – Front Panel

What came out of the box looked like the unholy progeny of a Sega Master System and the portholes from a Buick Roadmaster. Needless to say she ain't a looker. Frankly, at this price level the gear should be drop-dead sexy. Yes, it may be shallow, but its a requirement when you're trying to justify an $80 grand list price. To add insult to injury, the portholes don't serve any utilitarian purpose like cooling…they're actually a solid piece of plastic. As a counter example, the N2120 and Sun's standard 2U server design are phenomenal:


Sun N2120 & Sun 2U Server – Front Panel

They exude the simplicity and power that's concealed inside…a little glimpse to the upper echelons of what you're spending the company's hard earned bananas on. But what the AX2200 gets right is spot on build quality. It's solid with no rattles. The power supplies slide smoothly and easily. Re-seating a supply gives a firm click and solidly locks them from removal. Overall, it's downright Teutonic in construction. Sort of like an older Audi S8, built to run forever like greased lightning, but not much to look at. A10 could take Audi's cue and start paying attention to creating looks that match the engineering.


A10 Networks AX2200 – Back Panel

One very nice feature of the AX2200 for a load balancer is the hot swap fan tray. Not having to spirit the whole unit back to Boston because a fan went South is a nice change from the N1216. Also, the interior build quality is just as clean and professional as the exterior components. Hard edge connectors and system board tracings are used almost entirely, with nearly no ribbon cables cluttering up the interior. Only nit is the front management NIC is run to the motherboard via an RJ-45 cable routed to the back. Don't let the server exterior of this box fool you, this is a purpose built system with specialized ASICs and FPGAs inside.


A10 Networks AX2200 & Sun N2120 – Side View

As with any new appliance, this one has a couple of strange design foibles that go deeper than its looks. First, the box vents in from the sides and exhausts out the back. In that regard, its neither quite at home in a rack with your servers or with your switching and routing gear. The strange intake flow means if you rack the AX2200 above your side-to-side vented switching gear, you'll likely overheat the AX2200 as it sucks in the switches' side exhaust air. Luckily, we have some Juniper kit that vents front to back, so we will likely rack the AX2200s with them. Also, the locking drive carriers are a bit frustrating. It's a nice feature that they can be locked, but inserting the key with any more force than a gnat breaking wind pops out the removal handle. It's obviously an off-the-shelf carrier that no designer actually tried before spec'ing it out of the part book.


A10 Networks AX2200 – Front Panel No Bezel

On the positive side, the serial connection is on the front and is a Cisco-style RJ-45. Yippeee! No RS-232-to-rollover adapters to hook it into our Dominion SX! It may seem like a small thing, but it really means fewer parts to lose, break and stock at the data center. I wish I could say they had the foresight to also put a sticker on the front with the box's serial number…but unfortunately not so much. You'd better note the serial number before you rack the AX2200, otherwise its going to be crane and strain time to see the sticker on the bottom of the unit. Scratch that…the serial number is conveniently placed on the rear left of the unit as well. It's not as easy to see as the front given the Also, they did show the company's Foundry Networks pedigree by shipping a very Foundry Networks-esque self-test sheet with the unit:


AX2200 Self-Test Slip
 

Kudos for the self-test paperwork. If you keep that on file, you can probably forgive the serial number sticker's ill fated position on the unit's underside.

Overall, first impressions…the construction and major design decisions are terrific. This box looks the part internally, and feels the part externally as a major piece of core infrastructure. Next step is to rack her and beat the heck out of her with our test rig: a screaming UltraSPARC T2. :-) WIll post more soon on how the AX2200 stands the scorching…

Technorati Tags: , , ,

patchadd wherefore art thou? (Installing Sun Studio 12 on Indiana DP2)

0

Posted by Jason | Posted in Solaris | Posted on 03-07-2008

So you've got a shiny new Indiana DP2 install and you want to load Sun Studio. Hypothetically, let's say you're looking to compile mod_python. Then, half-way through the Sun Studio 12 install you hit a wall:/usr/sbin/patchadd doesn't exist. What's a boy to do?

The problem is that the SUNWswmt package which contains patchadd was omitted accidentally from the DP2 repository. The solution is to grab SUNWswmt and a few other dependencies from SXDE 01/08. But downloading 3 gigs of ISO to retrieve 1MB of packages seems like a little bit of overkill. So here is a tarball containing just the packages you need: <Temporarily Removed>

Install instructions:

  1. Unpack <Temporarily Removed> in the directory of your choice (and change to the root of that directory).
  2. Run: pfexec install SUNWadmc
  3. Run: pfexec pkgadd -d . SUNWpkgcmdsr (ignore any overwrite warnings and continue)
  4. Run: pfexec pkgadd -d . SUNWpkgcmdsu
  5. Run: pfexec pkgadd -d . SUNWinstall-patch-utils-root
  6. Run: pfexec pkgadd -d . SUNWswmt

Bada bing, bada boom you should be in business. Just install Sun Studio 12 normally at this point.