right_side
Posted on 13 Jan 2009 In: Software Development, Technology

Rabbits and warrens.

20089825.JPGThe goal was simple enough: decouple a particular type of analysis out-of-band from mainstream e-mail processing. We started down the MySQL road…put the things to be digested into a table…consume them in another daemon…bada bing bada boom. But pretty soon, complex ugliness crept into the design phase… You want to have multiple daemons servicing the queue?…no problem we’ll just hard code node numbers…what? you want dynamic load re-assignment when daemons join and die?

You get the idea…what was supposed to be simple (decouple something) was spinning its own Gordian knot. It seemed like a good time to see if every problem was looking like a nail (table), because all we had were hammers (MySQL).

A short search later, and we entered the world of message queueing. No, no…we know obviously what a message queue is. Heck, we do e-mail for a living. We’ve implemented all sorts of specialized, high-speed, in-memory queues for e-mail processing. What we weren’t aware of was the family of off-the-shelf, generalized, message queueing (MQ) servers…a language-agnostic, no-assembly required way to wire routing between applications over a network. A message queue we didn’t have to write ourselves? Hold your tongue.

Open up your queue…

Cutting to the chase, over the last 4 years there have been no shortage of open-source message queueing servers written. Most of them are one-offs by folks like LiveJournal to scratch a particular itch. Yeah, they don’t really care what kind of messages they carry, but their design parameters are usually creator-specific (and message persistence after a crash usually isn’t one of them). However, there are three in-particular, that are designed to be highly flexible message queues for their own sake:

Apache ActiveMQ gets the most press, but it appears to have some issues not losing messages. Next.

ZeroMQ and RabbitMQ both support an open messaging protocol called AMQP. The advantage to AMQP is that it’s designed to be a highly-robust and open alternative to the two commercial message queues out there (IBM and Tibco). Muy bueno. However, ZeroMQ doesn’t support message persistence across crashes reboots. No muy bueno. That leaves us with RabbitMQ. (That being said if you don’t need persistence ZeroMQ is pretty darn interesting…incredibly low latency and flexible topologies).

That leaves us with the carrot muncher…

RabbitMQ pretty much sold me the minute I read “written in Erlang”. Erlang is a highly parallel programming language developed over at Ericsson for running telco switches…yeah the kind with six bazillion 9s of uptime. In Erlang, its supposedly trivial to spin off processes and then communicate between them using message passing. Seems like the ideal underpinning for a message queue no?

Also, RabbitMQ supports persistence. Yes Virginia, if your RabbitMQ dies, your messages don’t have to die an unwitting death…they can be reborn in your queues on reboot. Oh…and as is always desired @ DigiTar, it plays nicely with python. All that being said, RabbitMQs documentation is well…horrible. Lemme rephrase, if you already understand AMQP, the docs are fine. But how many folks know AMQP? It’d be like MySQL docs assuming you knew some form of SQL…er…nevermind.

So, without further ado…here is a reduction of a weeks’ worth of reading up on AMQP and how it works in RabbitMQ…and how to play with it in Python:

Playing telephone

There are four building blocks you really care about in AMQP: virtual hosts, exchanges, queues and bindings. A virtual host holds a bundle of exchanges, queues and bindings. Why would you want multiple virtual hosts? Easy. A username in RabbitMQ grants you access to a virtual host…in its entirety. So the only way to keep group A from accessing group B’s exchanges/queues/bindings/etc. is to create a virtual host for A and one for B. Every RabbitMQ server has a default virtual host named “/”. If that’s all you need, you’re ready to roll.

Exchanges, Queues and bindings…oh my!

Here’s where my railcar went off the tracks initially. How do all the parts thread together?

Queues are where your “messages” end up. They’re message buckets…and your messages sit there until a client (a.k.a. consumer) connects to the queue and siphons it off. However, you can configure a queue so that if there isn’t a consumer ready to accept the message when it hits the queue, the message goes poof. But we digress…

The important thing to remember is that queues are created programmatically by your consumers (not via a configuration file or command line program). That’s OK, because if a consumer app tries to “create” a queue that already exists, RabbitMQ pats it on the head, smiles gently and NOOPs the request. So you can keep your MQ configuration in-line with your app code…what a concept.

OK, so you’ve created and attached to your queue, and your consumer app is drumming its fingers waiting for a message…and drumming…and drumming…but alas no message. What happened? Well you gotta pump a message in first! But to do that you’ve got to have an exchange…

Exchanges are routers with routing tables. That’s it. End stop. Every message has what’s known as a “routing key”, which is simply a string. The exchange has a list of bindings (routes) that say, for example, messages with routing key “X” go to queue “timbuktu”. But we get slightly ahead of ourselves.

Your consumer application should create your exchanges (plural). Wait? You mean you can have more than one exchange? Yes, you can, but why? Easy. Each exchange operates in its own userland process, so adding exchanges, adds processes allowing you to scale message routing capacity with the number of cores in your server. As an example, on an 8-core server you could create 5 exchanges to maximize your utilization, leaving 3 cores open for handling the queues, etc.. Similarly, in a RabbitMQ cluster, you can use the same principle to spread exchanges across the cluster members to add even more throughput.

OK, so you’ve created an exchange…but it doesn’t know what queues the messages go in. You need “routing rules” (bindings). A binding essentially says things like this: put messages that show up in exchange “desert” and have routing key “ali-baba” into the queue “hideout”. In other words, a binding is a routing rule that links an exchange to a queue based on a routing key. It is possible for two binding rules to use the same routing key. For example, maybe messages with the routing key “audit” need to go both to the “log-forever” queue and the “alert-the-big-dude” queue. To accomplish this, just create two binding rules (each one linking the exchange to one of the queues) that both trigger on routing key “audit”. In this case, the exchange duplicates the message and sends it to both queues. Exchanges are just routing tables containing bindings.

Now for the curveball: there are multiple types of exchanges. They all do routing, but they accept different styles of binding “rules”. Why not just create one type of exchange for all style of rules? Because each rule style has a different CPU cost for analyzing if a message matches the rule. For example, a “topic” exchange tries to match a message’s routing key against a pattern like “dogs.*”. Matching that wildcard on the end takes more CPU than simply seeing if the routing key is “dogs” or not (e.g. a “direct” exchange). If you don’t need the extra flexibility of a “topic” exchange, you can get more messages/sec routed if you choose the “direct” exchange type. So what are the types and how do they route?

Fanout Exchange - No routing keys involved. You simply bind a queue to the exchange. Any message that is sent to the exchange is sent to all queues bound to that exchange. Think of it like a subnet broadcast. Any host on the subnet gets a copy of the packet. Fanout exchanges route messages the fastest.

Direct Exchange - Routing keys are involved. A queue binds to the exchange to request messages that match a particular routing key exactly. This is a straight match. If a queue binds to the exchange requesting messages with routing key “dog”, only messages labelled “dog” get sent to that queue (not “dog.puppy”, not “dog.guard“…only “dog”).

Topic Exchange - Matches routing keys against a pattern. Instead of binding with a particular routing key, the queue binds with a pattern string. The symbol # matches one or more words, and the symbol * matches any single word (no more, no less). So “audit.#” would match “audit.irs.corporate”, but “audit.*” would only match “audit.irs”. Our friends at RedHat have put together a great image to express how topic exchanges work:

Source: Red Hat Messaging Tutorial: 1.3 Topic Exchange

 

Persistent little bugger…

You spend all that time creating your queues, exchanges and bindings, and then BANG!…the server fries faster than the griddle at McDonald’s. All your queues, exchanges and bindings are there right? Oh geez…what about the messages in the queues you hadn’t serviced yet?

Relax, providing you created everything with the default arguments, it’s all gone…poof…whoosh…nada…nil. That’s right, RabbitMQ rebooted as empty as a baby’s noggin. You gotta redo everything kemosabe. How do you keep this from happening in the future?

On your queues and your exchanges there’s a creation-time flag called “durable”. There’s only one thing durable means in AMQP-land…the queue or exchange marked durable will be re-created automatically on reboot. It does not mean the messages in the queues will survive the reboot. They won’t. So how do we make not only our config but messages persist through a reboot?

Well the first question is, do you really want your messages to persist? For a message to last through a reboot, it has to be written to disk, and even a simple checkpoint to disk takes time. If you value message routing speed more than the contents of the message, don’t make your messages persistent. That being said, for our particular needs @ DigiTar, persistence is important.

When you publish your message to an exchange, there’s a flag called “Delivery Mode”. Depending on the AMQP library you’re using there will be different ways of setting it (we’ll cover the Python library later). But the long and the short of it is you want the “Delivery Mode” set to the value 2, which means “persistent”. “Delivery Mode” usually (depending on your AMQP library) defaults to a value of 1, which means “non-persistent”. So the steps for persistent messaging are:

  1. Mark the exchange “durable”.
  2. Mark the queue “durable”.
  3. Set the message’s “delivery mode” to a value of 2

That’s it. Not really rocket science, but enough moving parts to make a mistake and send little Sally’s dental records into cyber-Nirvana.

There may be one thing nagging you though…what about the binding? We didn’t mark the binding “durable” when we created it. It’s alright. If you bind a durable queue to a durable exchange, RabbitMQ will automatically preserve the binding. Similarly, if you delete any exchange/queue (durable or not) any bindings that depend on it get deleted automatically.

Two things to be aware of:

  • RabbitMQ will not allow you to bind a non-durable exchange to a durable queue, or vice-versa. Both the exchange and the queue must be durable for the binding operation to succeed.
  • You cannot change the creation flags on a queue or exchange after you’ve created it. For example, if you create a queue as “non-durable”, and want to change it to “durable”, the only way to do this is to destroy the queue and re-create it. It’s a good reason to double check your declarations.

Food for snakes

A real empty area for AMQP usage is using it in Python programs. For other languages there are plenty of references:

But for little old Python, you need to dig it out yourself. So other folks don’t have to wander in the wilderness like I did, here’s a little primer on using Python to do the AMQP-tasks we’ve talked about:

First, you’ll need a Python AMQP library…and there are two:

  • py-amqplib - General AMQP library
  • txAMQP - An AMQP library that uses the Twisted framework, thereby allowing asynchronous I/O.

Depending on your needs, py-amqplib or txAMQP may be more to your liking. Being Twisted-based, txAMQP holds the promise of building super performing AMQP consumers that use async I/O. But Twisted programming is a topic all its own…so we’re going to use py-amqplib for clarity’s sake. UPDATE: Please check the comments for example code showing use of txAMQP from Esteve Fernandez.

AMQP supports pipelining multiple MQ communication channels over one TCP connection, where each channel is a communication stream used by your program. Every AMQP program has at least one connection and one channel:

  1. from amqplib import client_0_8 as amqp
  2. conn = amqp.Connection(host="localhost:5672 ", userid="guest",
  3.     password="guest", virtual_host="/", insist=False)
  4. chan = conn.channel()

Each channel is assigned an integer channel number automatically by the .channel() method of the Connection() class. Alternately, you can specify the channel number yourself by calling .channel(x) , where x is the channel number you want. More often than not, its a good idea to just let the .channel() method auto-assign the channel number to avoid collisions.

Now we’ve got a connection and channel to talk over. At this point, our code is going to diverge into two applications that use that same bit we’ve created so far: a consumer and the publisher. Let’s create the consumer app by creating a queue named “po_box” and an exchange named “sorting_room”:

  1. chan.queue_declare(queue="po_box", durable=True,
  2.     exclusive=False, auto_delete=False)
  3. chan.exchange_declare(exchange="sorting_room", type="direct", durable=True,
  4.     auto_delete=False,)

What did that do? First, it created a queue called “po_box” that is durable (will be re-created on reboot) and will not be automatically deleted when the last consumer detaches from it (auto_delete=False). It’s important to set auto_delete to false when making a queue (or exchange) durable, otherwise the queue itself will disappear when the last consumer detaches (regardless of the durable flag). Setting both durable and auto_delete to true, would make a queue that would be recreated only if RabbitMQ died unexpectedly with consumers still attached.

(You may have noticed there’s another flag specified called “exclusive”. If set to true, only the consumer that creates the queue will be allowed to attach to it. It’s a queue that is private to the creating consumer.)

There’s also the exchange declaration for the “sorting_room” exchange. auto_delete and durable mean the same things as they do in a queue declaration. However, .exchange_declare() introduces an argument called type that defines what type of exchange you’re making (as described earlier): fanout, direct or topic.

At this point, you’ve got a queue to receive messages and an exchange to publish them to initially…but we need a binding to link the two together:

  1. chan.queue_bind(queue="po_box", exchange="sorting_room",
  2.       routing_key="jason")

The binding is pretty straight forward. Any messages arriving at the “sorting_room” exchange with the routing key “jason” gets routed to the “po_box” queue.

Now, there’s two methods of getting messages out of the queue. The first is to call chan.basic_get() to pull the next message off the queue (if there are no messages waiting on the queue, chan.basic_get() will return a None object…thereby blowing up the print msg.body code below if not trapped) :

  1. msg = chan.basic_get("po_box")
  2. print msg.body
  3. chan.basic_ack(msg.delivery_tag)

But what if you want your application to be notified as soon as a message is available for it? To do that, instead of chan.basic_get(), you need to register a callback for new messages using chan.basic_consume():

  1. def recv_callback(msg):
  2.      print 'Received: ' + msg.body
  3. chan.basic_consume(queue='po_box', no_ack=True,
  4.                 callback=recv_callback, consumer_tag="testtag")
  5. while True:
  6.      chan.wait()
  7. chan.basic_cancel("testtag")

chan.wait() is looped infinitely, which is what causes the channel to wait for the next message notification from the queue. chan.basic_cancel() is how you unregister your message notification callback. The argument specifies the consumer_tag you specified in the original chan.basic_consume() registration (that’s how it figures out which callback to unregister). In this case chan.basic_cancel() never gets called due to the infinite loop that precedes it…but you need to know about it, so it’s in the snippet.

The one additional thing you should pay attention to in the consumer is the no_ack argument. It’s accepted on both chan.basic_get() and chan.basic_consume() and defaults to false. When you grab a message off a queue, RabbitMQ needs you to explicitly acknowledge that you have it. If you don’t, RabbitMQ will re-assign the message to another consumer on the queue after a timeout interval (or on disconnect by the consumer that initially received it without ack’ing it). If you set the no_ack argument to true, then py-amqplib will add a “no_ack” property to your AMQP request for the next message. That will instruct the AMQP server to not expect an acknowledgement for that get/consume. However, in most cases, you probably want to send the acknowledgement yourself (e.g. you need to put the message contents in a database before you acknowledge). Acknowledgements are done by caling the chan.basic_ack() method, using the delivery_tag property of the message you’re acknowledging as the argument (see the chan.basic_get() code snippet above for an example).

That’s all she wrote for the consumer. (Download: amqp_consumer.py)

But what good is a consumer, if nobody is sending it messages? So you need a publisher. The code below will publish a simple message to the “sorting_room” exchange and mark it with the routing key “jason”:

  1. msg = amqp.Message("Test message!")
  2. msg.properties["delivery_mode"] = 2
  3. chan.basic_publish(msg,exchange="sorting_room",routing_key="jason")

You may notice that we set the delivery_mode element of the message’s properties to “2”. Since the queue and exchange were marked durable, this will ensure the message is sent as persistent (i.e. will survive a reboot of RabbitMQ while it is in transit to the consumer).

The only other thing we need to do (and this needs to be done on both consumer and publisher apps), is close the channel and connection:

  1. chan.close()
  2. conn.close()

Pretty simple, no? (Download: amqp_publisher.py)

Giving it a shot…

Now we’ve written our consumer and publisher, so let’s give it a go. (This assumes you have RabbitMQ installed and running on localhost.)

Open up the first terminal, and run python ./amqp_consumer.py to get the consumer running and to create your queues, exchanges and bindings.

Then run python ./amqp_publisher.py “AMQP rocks.” in a second terminal. If everything went well, you should see your message printed by the consumer on the first terminal.

Taking it all in

I realize this has been a really fast run through AMQP/RabbitMQ and using it from Python. Hopefully, it will fill in some of the holes of how all the concepts fit together and how they get used in a real Python program. If you find any errors in my write-up, I’d very much appreciate it if you’d please let me know (williamsjj@digitar.com). Similarly, I’d be happy to answer any questions that I can. Next up….clustering! But I’ve got to figure it out first. :-)

NB: Special thanks to Barry Pederson and Gordon Sims for correcting my understanding of no_ack’s operation and for catching syntactically incorrect Python code I missed.

NB: My knowledge on the subject was distilled from these sources, which are excellent further reading:

Posted on 10 Nov 2008 In: DigiTar, Solaris, Technology

Viva la storage.

Coming soon… ;-)

Posted on 30 May 2008 In: Technology

HSPA (AT&T) vs. EV-DO (Verizon)

Some folks hate to be offline, and some folks can't afford to be. I suppose I fit somewhere in between. About a month ago, I realized I was going to be doing some significant traveling…probably nowhere near a decent WiFi access point. Thus arose the question…how do you connect back to the office regardless of where your derrière happens to be? There were only a couple of minor requirements:

  • (Good) National 3G (U.S.A.) coverage
  • Minimum top end throughput around 1Mb/s
  • ExpressCard form factor (nothing sexier than a wrist-sized dongle cantilevered off your USB port)
  • Support for Mac OS X

Folks that know me probably are stunned at the last one. As of April 29th I kicked the Dell habit. My regular target of abuse is now a MacBook Pro. But that's a whole other story…

Anywho, those req's really narrowed it down to two players: AT&T and Verizon. Both offer national 3G access at speeds of 1Mb/s or greater. But they take two different approaches to it…

HSPA (High Speed Packet Access)

High Speed Packet Access is really the joining of two different 3G GSM protocols: HSDPA and HSUPA (the D and the U are “downlink” and “uplink” respectively). On AT&T's network, HSPA should give you average speeds around 1.8 Mb/s down and 800 Kb/s up. My experience has been that this is true across their network…as long as you can get a 3G signal. In fact, in some areas (LA and San Antonio) it wasn't uncommon for me to get around 2.2-2.5 Mb/s down. With tower upgrades coming early next year, the downlink speed should boost further to about 7.2 Mb/s. Overall, pretty darn good for no leash. Factor in the fact that HSPA is a 3G GSM standard widely deployed across Europe/Japan and suddenly you've got a great data solution worldwide (an issue given some upcoming trips). Oh, I forgot to mention…some places in Europe have already deployed 14.4Mb/s HSDPA (HSUPA deployment is somewhat spottier).

Compared to EV-DO, HSPA also has some design advantages. For example, both EV-DO and HSPA time slice transmission to connected clients, but HSPA can transmit to 10 clients in single time slice, whereas EV-DO can only transmit to one client per time slice. Also, HSPA towers possess the capability to figure out which clients have the best signal quality and will transfer bandwidth capacity from clients who can't use it (bad signal) to clients that can (excellent signal). Of course, even with all of its advantages, the HSPA network is being run by AT&T…and they could screw up implementation of a PB&J sandwich…

EV-DO (EVolution - Data Optimized)

Like HSPA, EV-DO is a CDMA-based 3G protocol. Unlike HSPA however, it is not a GSM body standard and is instead the successor to CDMA2000. So, outside of the U.S.A, Korea, areas of Japan and piroshki stands in the former Soviet-bloc you're pretty much out of luck for access. However, it does provide 1Mb/s speeds regularly. Upload speeds are in the 200-500 Kb/s range.

With that brief understanding, I motored down to the Verizon and AT&T stores and picked up service with both companies (AT&T and Verizon have 30-day refund and new service cancellation policies).

Behind door number 1…

For a couple of years now, I've heard phenomenal things about Verizon's BroadbandAccess (EV-DO) service. People seemed to rave about it's coverage and reliability…and they're right. Verizon's biggest plus is it's consistency. It may not be as fast or have as low latency when AT&T is on the ball, but they'll deliver the same service levels every time you power up. I don't care if I was in Boise, LA, or San Antonio, Verizon delivered 800-1000 Kb/s throughput and 230ms latency like clockwork. Sometimes it was a bit better or a bit worse, but only by about 10% (exception was the trip up the coast to Malibu where Verizon dropped down to 2.5G service and AT&T was nowhere to be seen).

The other nice thing about Verizon is the Novatel V740 ExpressCard. It has excellent support in OS X. Pop it in and OS X's built-in WWAN manager configures the card, activates it with Verizon and away you go. No special software to install. You even get a nice little signal strength meter on the task bar (yeah…yeah…taskbar…Windows habits die hard ;-) ).

The gal you really wanna take home to Mom…

I wanted AT&T to be the best…honest. I'm a current AT&T Wireless voice customer and love the phones, the stores and the service. However, 3G service with AT&T lacks Verizon's consistency. Initially, 3G coverage could be hit and miss when I started 4 weeks ago. However, their big push for blanket 3G coverage in advance of the 3G iPhone launch has improved the 3G network dramatically in the last week. Although the coverage is spot on now, the service level is not in terms of latency. Throughput however is phenomenal. 1500-1800 Kb/s downlink speeds over 90% of the time, with solid 2200-2500 Kb/s in areas with the latest tower gear. So for the majority of applications, AT&T LaptopConnect is a superior solution to Verizon. But…not for me.

Quite a bit of the remote work I do involves either SSH or Windows Remote Desktop over VPN. There's few things more annoying than mistyping a command and waiting for the refresh to catch up so you can go correct it. As a result, better latency means a happier camper 'round these parts. That's not to say that AT&T's latency is awful. In fact, it's better than Verizon about 80% of the time when you measure it. So why am I complaining about it? Well, 150ms latency is only good if it stays at 150ms. AT&T's deployment of HSPA causes latency spikes regularly, particularly under load. As a result, I started doing an combo test on both services…load a YouTube video and concurrently check the ping over a VPN tunnel. If you try it, you'll see both Verizon and AT&T's latency spike dramatically. Hmm..you're probably thinking, “so AT&T is better than Verizon both with and without heavy load…why won't you say its better?”. Because, it doesn't feel faster. It was really hard to put a metric on this, because while the measurements were better on AT&T, the lag while typing on an SSH connection always felt a little (to a lot) bit slower. In fact, I kept reminding myself that this had to be in my head, because the ping measurements were better than Verizon. Then while in San Antonio I tried using Skype.

San Antonio expectedly has the best coverage of any AT&T area I've been in. Consistent throughput above 2200 Kb/s and latency below 150ms. So imagine my surprise when my SSH sessions seemed laggy, and the Skype calls would start great and then break down within about 3-4 minutes. You could hear the person on the other end of the call fine, but they started having issues hearing me and my video would lock up for them. If you turn the video off it'd buy you another 4-5 minutes before the call went haywire. So back in went the Verizon card. Bang. Perfect SSH sessions. Crystal clear call quality on Skype…and the folks on the other end said not only was the video smooth but the quality of the picture was better (Skype must adjust video quality based on connection quality). A 45-minute Skype call completed with no audio or video issues on Verizon.

I tried the Skype exercise about 3-4 times over a 48-hour period with the same results. Every time I'd give AT&T a shot, and every time I'd have to drop in the Verizon card to complete a decent conversation. This bodes not well for the rumor that the 3G iPhone will take advantage of HSUPA for video conferencing. On the positive side, those consistent 2200 Kb/s AT&T downlink speeds meant I was able to suck down the OS X 10.5.3 system update (420MB) in about 30 minutes (~1500 Kb/s sustained average).

The other major issue with AT&T is the Option GT Ultra Express card. On the positive side, it supports HSUPA so you can take advantage of fast uplink speeds. Unfortunately, it isn't supported natively by the OS X WWAN subsystem (unlike it's unavailable predecessor, the Option GT 3.6 Express which is natively supported). So you have to install Option's GlobeTrotter software, which isn't a slick as the native support and frankly feels poorly built. A lot of folks on the Apple and AT&T forums have also complained about GlobeTrotter frequently crashing for them. To some degree I suspect the inconsistent performance I get from AT&T (despite the metrics) might be due to GlobeTrotter. There's also Launch2Net by NovaMedia, which provides 3rd party drivers for the GT Ultra Express. Still not native, and amazingly Launch2Net axes the native WWAN utilities that the Verizon card leverages (Launch2Net got uninstalled faster than Vista on a 286). Supposedly, OS X 10.5.3 was going to include native support for the GT Ultra Express, but as of 10.5.3's release yesterday…no dice.

Lastly, there's price. Both AT&T and Verizon charge $60/month. However, AT&T's service is unlimited where Verizon's service is 5GB/month (and $0.49/MB over that).

End of the road…

So where does that leave us? If you need reliable latency and pretty darn good speed, Verizon is your best bet in my opinion. On the other hand, if the majority of your remote work involves the web, e-mail or anything else that's not latency sensitive, AT&T is far superior and will allow global roaming. Frankly, I'm kind of anxious to hear from someone who has the GT Ultra Express on a Windows machine to find out if the inconsistent performance I experienced was specific to GlobeTrotter for Mac. Personally, I'm going to keep both services. There were a handful of times that Verizon's latency was abysmal, but AT&T's was great. Enough that I realized in an emergency I'd really need to have the option of either service.

Here's hoping AT&T's 3G latency improves…and that Apple gets with the program and includes native support for the Option GT Ultra Express…the 3G ExpressCard of choice for Apple's carrier of choice. Sorry that this post blathered on a bit long. I hope this saves other folks from having to do this much evaluation legwork.

(Here is the XLS sheet with observed metrics for both services: AT&T vs. Verizon Benchmarks )

Posted on 28 May 2008 In: DigiTar, Solaris, Technology

Remember the Alamo…

Tomorrow (05/28/2008) I'm giving a talk on moving to open storage (i.e. ethernet, OpenSolaris and SATA…in no particular order) at the Diocesan Information Systems Conference in San Antonio. It's a closed event, but here are the slides from the talk…including the talking notes which cover a lot more than I'll probably have time for:

PDF
Slideshare

Posted on 21 Apr 2008 In: DigiTar, Solaris

Democratizing Storage

Opensolaris_logo_trans

As a company that was heavily populated with Linux zealots, it’s been surreal for us to watch OpenSolaris develop for the past 3 years. While technologies like DTrace and FMA are features we now use everyday, it was storage that brought Solaris into our environment and continues to drive it deeper into our services stack. Which begs the question: Why? Isn’t DTrace just as cool as ZFS? Haven’t Solaris Containers dramatically changed the way we provision and utilize systems? Sure…but storage is what drives our business and it doesn’t seem to me that we’re alone.

Everything DigiTar does manipulates or massages messaging in some way. When most people think of what drives our storage requirements they think of quarantining or archiving e-mail. But when you’re dealing with messages that can make or break folks’ businesses, logging the metadata is perhaps the most important thing we do.

Metadata is flooding in every second. It’s at the center of everything from proving a message was delivered to ensuring we meet end-to-end processing times and SLAs. If we didn’t quarantine any more messages, we’d still generate gigabytes of data every day that can’t be lost. Without reliable and scalable storage we wouldn’t exist.

Lost IOPs, Corruption and Linux…oh my!

What got us using OpenSolaris was Linux’s (circa 2005) unreliable SCSI and storage subsystems. I/Os erroring out on our SAN array would be silently ignored (not retried) by Linux, creating quiet corruption that would require fail-over events. It didn’t affect our customers, but we were going nuts managing it. When we moved to OpenSolaris, we could finally trust that no errors in the logs literally meant no errors. In a lot of ways, Solaris benefits from 15 years of making mistakes in enterprise environments. Solaris anticipates and safely handles all of the crazy edge cases we’ve encountered with faulty equipment and software that’s gone haywire.

When it comes to storing data, you’ll pry OpenSolaris (and ZFS) out of our cold dead hands. We won’t deploy databases on anything else.

Liberation Day

While we moved to Solaris to get our derrières out of a sling, being on OpenSolaris has dramatically changed the way we use and design storage.

When you’ve got rock-solid iSCSI, NFS, and I/O multipathing implementations, as well as a file system (ZFS) that loves cheap disks…and none of it requires licensing…you can suddenly do anything. Need to handle 3600 non-cached IOPs for under $60K? No problem. Have an existing array but can’t justify $10K for snapshotting? No problem. How ‘bout serving line-rate iSCSI with commodity storage and CPUs? No problemo.

That’s the really amazing thing about OpenSolaris as a storage platform. It has all of the features of an expensive array and because it allows you to build reliable storage out of commodity components, you can build the storage architecture you need instead of being held hostage by the one you can afford. But features like ZFS don’t mandate that you change your architecture. You can pick and choose the pieces that fit your needs and make any existing architecture better too.

So how has OpenSolaris changed the way DigiTar does storage? For one thing, it’s enabled us to move almost entirely off of our fibre-channel SAN. We get better performance for less money by putting our database servers directly on Thumpers (Sun Fire X4500) and letting ZFS do its magic. Also, because its ZFS, we’re assured that every block can be verified for correctness via checksumming. By doing application-level fail-over between Thumpers, we get shared-nothing redundancy that has increased our uptime dramatically.

One of the things that always has bugged me about traditional clustering is its reliance on shared storage. That’s great if the application didn’t trash its data while crashing to the ground. But what if it did? To replicate the level of redundancy we get with two X4500s, we’d have to install two completely separate storage arrays…not to mention also buy two very large beefy servers to run the databases. By using X4500s, we get the same reliability and redundancy for about 85% less cost. That kind of savings means we can deploy 6.8x more storage for the same price footprint and do all sorts of cool things like:

  • Create multiple data warehouses for data mining spam and mal-ware trends.
  • Develop and deploy new service features whenever we want without considering storage costs.
  • Be cost competitive with competitors 10x our size.

Whether you’re storing pictures of your kids, or archiving business critical e-mail (or anything in between), it seems to me that being able to store massive amounts of data reliably is as fundamental to computing today as breathing is to living. OpenSolaris allows us as a company to stop worrying about what its going to cost to store the results of our services, and focus on what’s important: developing the services and features themselves. When you stop focusing on the cost of “air”, you’re liberated to actually make life incredible.

I could continue blathering about how free snapshotting (both in terms of cost and performance hit) can allow you to re-organize your backup priorities, or a bunch of other very cool benefits of using OpenSolaris as your storage platform. But you should give it a shot yourself, because OpenSolaris’ benefits are as varied and unique as your environment. Once you give it a try, I think you’ll be hard pressed to go back to vendor lock-in…but I’m probably a bit biased now.  I think you’ll also find an community around OpenSolaris that is by far the friendliest and most mature open source group of folks you’ve ever dealt with.

It's nice when you boot an appliance and the web user interface doesn't look like it was designed by a guy who thought Jurassic Park and The Net were the pinnacle of UI design. The A10 Advanced Core OS (ACOS) has an incredibly polished look to the WebUI. Frankly, its beautiful. All chrome and glass so to speak…

 

Overall, the Web UI is very easy to navigate and options are not buried more than 2 clicks deep. However, there are two areas where the ACOS Web UI is absolutely a pain in the rear:

  • Grid-metaphor editing.
  • Heinous layout for the relationship between physical interfaces, VLANs and virtual interfaces.

Grid Editing 

One of the most common day-to-day tasks we end up doing with a load balancer is enabling/disabling a batch of real servers for upgrade. Generally, we want to:

  1. Disable real servers A, C, and E. (Leaving B & D enabled).
  2. Upgrade A, C, and E.
  3. Swap A, C, and E back into battery and take B & D out.
  4. Upgrade B & D.
  5. Put B & D back in.

This is a perfect application where you want to be able to pull up the settings for multiple entries in one edit table. With the settings for real servers A,B,C,D, and E up on the same page, you can change all of the applicable settings all at the same time, verify each server is correct, then bam!…slam the new settings into place all at once. Unfortunately, this is not possible with the ACOS Web UI. The only thing you can do to multiple entries at once is delete them:

 

But simple maintenance of real server status is not the only place with the table editing metaphor is helpful. It is indispensable when trying to balance which VLANs are on which physical ports. Having to drill into an entry, make the change, and then re-examine the grid view to see how it looks is very tedious. It's much easier to pull up all the necessary interface/VLAN assignments on one view, edit them in-place and then apply them with a single-click once they look right. It seems that the goal of any good Web UI should be  to minimize round trips and enable batch application as much as possible. This was an area where the Nauticus/Sun Web UI was phenomenal. Any grid view could be turned into an edit table. On the other hand, if you only selected one entry to edit, the Nauticus Web UI was smart enough to reformat the one entry into a single column of editable values (so it fit horizontally without scrolling). Quickly swapping batches of real servers in and out of service is not a task we're looking forward to with the AX2200.

Network Relationships & Just Being Friendly 

This is not an uncommon metaphor for dealing with VLANs and the IP interfaces that sit on them:

  • VLANs entries belong to physical interfaces.
  • Virtual IP interfaces are created and belong to specific VLANs entries.

To A10s credit, it's a familiar metaphor that is instantly accessible, and they even kept the ve0, ve1 virtual interface naming convention that's common to Cisco and Foundry equipment. Where they went wrong is not making it easy to tag a friendly name onto the VLAN and virtual interface entries.

What's the purpose of VLAN 1234? Well it's attached virtual interface ve0…that's helpful. What on God's Green Earth does ve0 serve? You can't tell easily from the VLAN page. You either have to dig out your documentation, or open the virtual interface list in a separate window:

 

The simple solution on Foundry and Nauticus/Sun gear was what you could call “friendly names”: A simple user description for each VLAN, interface and virtual interface. Can't remember what VLAN 1234 does…no problem…it's friendly name says “tier1_realservers”. Oh! That's right, VLAN 1234 contains the application servers for tier 1 of our application and ve0 is the virtual interface that serves that subnet. Toggling back and forth between tabs in Firefox for VLANs and virtual interfaces while setting up the test AX2200 has been a barrel of monkeys. Frankly, “friendly” or “vanity” names should be able to be attached to any type of entry whether it's a real server, a physical interface, or an SSL certificate.

Other nits so far: 

  • Appliance will not boot if hard drives are not in exactly the same slots as shipped (not expected for a RAID-1 setup).
  • Can't find a mechanism in the Web UI to generate a CSR.
  • Can't find a way to import a PEM file (Must import certificate and key file separately.)
  • There doesn't appear to be a way to load certificates and keys by pasting them into a text box.
  • Host name is not notated at the top of the GUI and in the page title at all times to help identify which box you're in.
  • Virtual interfaces that are already in use still show up in the VLAN creation screen as assignable. Only when “Apply” is clicked does a JavaScript alert box tell you there's an issue.
  • Physical front panel status light only blinks when there's a problem. Does not turn amber or red. Very unnoticeable if you don't already know there's an issue.
  • Showing system interfaces via the CLI is “sh int” instead of “sh sys int” on Foundry gear.

The last one is not something normally you'd complain about. All networking vendors seem to do it differently. However, given the fact that A10 is staffed with so many ex-Foundry Networks folks, and the fact that the ACOS CLI is identical to Ironware in so many areas, it's an unwelcome surprise when “sh sys int” errors out while you're in the CLI.

Needless to say, we're still talking about the AX2200, so we're fairly happy with what we've seen so far. However, “friendly naming” and table editing  really need to be fixed in an upcoming version of ACOS. The current way of doing things is probably only acceptable in very small environments where the boxes don't get touched very much. This weekend is dedicated to SLB testing…so hopefully more advanced configuration is where the Web UI really comes together.

That's all that's fit to print as they say.

Technorati Tags: , , , ,

It's been nearly two years since we ventured into the wonderland of replacing our Alteon gear with the Sun N1216. It was a big risk because load balancers are interlaced tightly with our multi-phased mail logistics architecture. To say the least, we have not been disappointed. The Sun N1216 series is by far the best load balancer we've ever worked with. Almost limitless power (~3Gbps) for a $25K list price. (Its big brother the N2120 was the Bugatti Veyron of the load balancer world.) But more than power, the N series provides an incredibly elegant and powerful virtualization that is irreplaceable. It enabled us to reduce what were multiple pairs of Alteons down to a single pair of N1216s running multiple virtual load balancer instances.

But what blew us away was a very simple feature we'll call “assignable virtual IP address (VIP)”. Assignable VIP functionality allows you to create two virtual load balancers (internal and external) with no routing in common, and attach your real servers to one (internal), while advertising the VIP on the other (external). Because there is no routing path between them (all traffic hitting the VIP is essentially memory copied to the internal load balancer for SLB processing), no servers sitting in your DMZ can compromise or talk directly to your real servers. They simply can't talk to something that there's no routing path to. As a result, you have a separate clean management path to your real servers that is entirely inside your trusted network, and incredibly simplifies your topology (no ACLs!). It is by far the best application of virtualization in a network device we've ever seen. However, the halcyon days came to an end in April of 2007 when we were informed that Sun intended to EOL the entire N series and shutdown the load balancing group they had acquired with Nauticus. Given that there were no other products on the market in April of 2007 that could even remotely drop seamlessly into our new topology, we decided  to wait and see what Sun might do next.

A year later not much has changed, and Sun still doesn't have a coherent strategy on load balancing to replace the N series. While our units would continue to be supported for the next 5 years, there won't be software updates, and definitely no updates to the phenomenal FPGAs that make the box scream. There are flaws in the N series that need bug updates…things that would be livable if they were going to be fixed. But in a production environment no bug fixes is simply not an acceptable strategy. So we're back in wonderland…

To cut to the chase, we talked with all the major vendors and settled down to F5, Citrix/NetScaler, and Cisco. Only Cisco, with their ACE platform, has any virtualization story whatsoever. Everyone else has no virtualization plans that they're telling their sales dudes about. All 3 can cobble together an inelegant and obfuscated configuration to allow us to maintain our topology and security stance, but none can do the “assignable VIP” magic that made Sun/Nauticus such an amazing application of virtualization and so clean to administer.

In the middle of all this, a trusted friend at Sun recommended we take a look at a new load balancing company, A10 Networks. Now A10 doesn't have virtualization in their platform today, and they definitely don't have “assignable VIP”. But they have a story and roadmap that will make any Sun/Nauticus customer get a big silly grin on their face. You'll have to talk to A10 to find out the particulars. ;-) 

What does A10 have? A phenomenal architecture on paper, and sane licensing. While FPGAs are what made the Nauticus design scream, being entirely FPGA and ASIC driven was also what drove the cost of bug fixes up. It was difficult for them to add L4/L7 features at the same rate that F5 and others were, because it usually required a modification of the FPGA layout. Enter what appears to be a brilliant design compromise and excellent capitalization on the Intel/AMD race for core count. The A10 AX2200 and above have L2/L3 ASICs, SSL ASICs,  and a L4/L7 traffic director FPGA. The FPGA dynamically assigns new connections to each of the box's 4-8 Xeon cores for full L4/L7 processing. Also, each core operates independently from the others. That is to say, there is no contention or synchronization penalties for using more cores. Add more connections and the traffic FPGA evenly distributes them among the cores, and stitches the results back together for the client. Near perfect parallelization. All of the heavy L4/L7 lifting is done entirely in software on generic Xeon cores. This allows A10 to quickly add the complex features (like F5) that would have required an FPGA modification on the Nauticus gear. The excellent parallelization model ensures the performance hit encountered by using generic CPUs instead of FPGAs can be made up for linearly by adding buckets of cores. The FPGA is therefore much simpler in design than what Nauticus required. But as I said, this is all on paper.

However, it is an equally seductive design to what Nauticus created. F5, NetScaler and Cisco all have L2/L3 ASICs in their boxes but nothing really significant in terms of hardware acceleration in the L4/L7 areas (F5 does have their L4 ASIC that does provide good acceleration of basic L4 TCP termination load balancing). So we've decided to leap again and take a chance with A10. Also, A10 includes Global Server Load Balancing for free and does not engage in F5's hideous practice of licensing HTTP compression and SSL offload capacity by the MB/s…oh and A10 has TCL-based aRules. ;-)

So we eagerly awaited the FedEx guy on Thursday to deliver our new pair of AX2200s for validation testing. With a 100lb thump they landed solidly on our testing table, and a couple flicks of a box cutter later…


A10 Networks AX2200 - Front Panel

What came out of the box looked like the unholy progeny of a Sega Master System and the portholes from a Buick Roadmaster. Needless to say she ain't a looker. Frankly, at this price level the gear should be drop-dead sexy. Yes, it may be shallow, but its a requirement when you're trying to justify an $80 grand list price. To add insult to injury, the portholes don't serve any utilitarian purpose like cooling…they're actually a solid piece of plastic. As a counter example, the N2120 and Sun's standard 2U server design are phenomenal:


Sun N2120 & Sun 2U Server - Front Panel

They exude the simplicity and power that's concealed inside…a little glimpse to the upper echelons of what you're spending the company's hard earned bananas on. But what the AX2200 gets right is spot on build quality. It's solid with no rattles. The power supplies slide smoothly and easily. Re-seating a supply gives a firm click and solidly locks them from removal. Overall, it's downright Teutonic in construction. Sort of like an older Audi S8, built to run forever like greased lightning, but not much to look at. A10 could take Audi's cue and start paying attention to creating looks that match the engineering.


A10 Networks AX2200 - Back Panel

One very nice feature of the AX2200 for a load balancer is the hot swap fan tray. Not having to spirit the whole unit back to Boston because a fan went South is a nice change from the N1216. Also, the interior build quality is just as clean and professional as the exterior components. Hard edge connectors and system board tracings are used almost entirely, with nearly no ribbon cables cluttering up the interior. Only nit is the front management NIC is run to the motherboard via an RJ-45 cable routed to the back. Don't let the server exterior of this box fool you, this is a purpose built system with specialized ASICs and FPGAs inside.


A10 Networks AX2200 & Sun N2120 - Side View

As with any new appliance, this one has a couple of strange design foibles that go deeper than its looks. First, the box vents in from the sides and exhausts out the back. In that regard, its neither quite at home in a rack with your servers or with your switching and routing gear. The strange intake flow means if you rack the AX2200 above your side-to-side vented switching gear, you'll likely overheat the AX2200 as it sucks in the switches' side exhaust air. Luckily, we have some Juniper kit that vents front to back, so we will likely rack the AX2200s with them. Also, the locking drive carriers are a bit frustrating. It's a nice feature that they can be locked, but inserting the key with any more force than a gnat breaking wind pops out the removal handle. It's obviously an off-the-shelf carrier that no designer actually tried before spec'ing it out of the part book.


A10 Networks AX2200 - Front Panel No Bezel

On the positive side, the serial connection is on the front and is a Cisco-style RJ-45. Yippeee! No RS-232-to-rollover adapters to hook it into our Dominion SX! It may seem like a small thing, but it really means fewer parts to lose, break and stock at the data center. I wish I could say they had the foresight to also put a sticker on the front with the box's serial number…but unfortunately not so much. You'd better note the serial number before you rack the AX2200, otherwise its going to be crane and strain time to see the sticker on the bottom of the unit. Scratch that…the serial number is conveniently placed on the rear left of the unit as well. It's not as easy to see as the front given the Also, they did show the company's Foundry Networks pedigree by shipping a very Foundry Networks-esque self-test sheet with the unit:


AX2200 Self-Test Slip
 

Kudos for the self-test paperwork. If you keep that on file, you can probably forgive the serial number sticker's ill fated position on the unit's underside.

Overall, first impressions…the construction and major design decisions are terrific. This box looks the part internally, and feels the part externally as a major piece of core infrastructure. Next step is to rack her and beat the heck out of her with our test rig: a screaming UltraSPARC T2. :-) WIll post more soon on how the AX2200 stands the scorching…

Technorati Tags: , , ,

So you've got a shiny new Indiana DP2 install and you want to load Sun Studio. Hypothetically, let's say you're looking to compile mod_python. Then, half-way through the Sun Studio 12 install you hit a wall:/usr/sbin/patchadd doesn't exist. What's a boy to do?

The problem is that the SUNWswmt package which contains patchadd was omitted accidentally from the DP2 repository. The solution is to grab SUNWswmt and a few other dependencies from SXDE 01/08. But downloading 3 gigs of ISO to retrieve 1MB of packages seems like a little bit of overkill. So here is a tarball containing just the packages you need: <Temporarily Removed>

Install instructions:

  1. Unpack <Temporarily Removed> in the directory of your choice (and change to the root of that directory).
  2. Run: pfexec install SUNWadmc
  3. Run: pfexec pkgadd -d . SUNWpkgcmdsr (ignore any overwrite warnings and continue)
  4. Run: pfexec pkgadd -d . SUNWpkgcmdsu
  5. Run: pfexec pkgadd -d . SUNWinstall-patch-utils-root
  6. Run: pfexec pkgadd -d . SUNWswmt

Bada bing, bada boom you should be in business. Just install Sun Studio 12 normally at this point.

Posted on 22 Jan 2008 In: Solaris

OpenSolaris as Storage Server @ BizExpo

I'm giving a presentation today at the Boise BizExpo about using OpenSolaris as the foundation for your storage infrastructure. It's a partial re-work of my ZFS talk at FORUM 2007, but focuses more on building a NAS/SAN platform out of OpenSolaris and less on ZFS in particular. If you'd like to see it in person, it's at 11:00am MST (01/23/2008) at the Boise BizExpo (Boise Center on the Grove, Room 4). Hopefully it's helpful to someone…and folks with forgive the stuttering. ;-) The actual slides can be downloaded here: Liberating Storage with OpenSolaris 

Technorati Tags: , , , ,

Awhile back (~July 2006) we moved our core MySQL clusters to ZFS in order to…among other things…simplify our backup regimen. Nearly two years later, I can honestly say I'm in love with ZFS mostly because of how much its simplified and shored-up our MySQL infrastructure. ZFS and MySQL go together like chips and salsa.

Now, backing up live MySQL servers is a bit like trying to take a picture of a 747's fan blades while it's in flight. There's basically three camps of MySQL backups:

 

  • Use mysqldump, which locks your tables longer than a Tokyo stoplight.
  • Quickly quiesce/lock your tables and pull a volume snapshot.
  • Or….use InnoBase Hot Backup. Outside of being licensed per server, ibbackup works very well and allows InnoDB activity to continue uninterrupted. The only downside is if you have to backup MyISAM tables (lock city!).

The mysqldump method worked best when our apps were in the development and testing phase…i.e. the datasets were tiny. However, once you get above a few hundred megabytes, the server had better be a non-production one. Frankly, for our infrastructure, the snapshot method was the most appealing. Among other reasons, it didn't require per-server licensing, and was the best performing option for our mixed bag of InnoDB and MyISAM tables. Initially, the plan was to use the snapshot feature in our shiny new STK arrays…however, just about the time our new arrays arrived, ZFS was released in a reasonably stable form. While snapshotting is not unique to ZFS (and it is a widely used approach to MySQL backups), there are a few benefits to relying on ZFS snapshots for MySQL backups:

  • ZFS snapshots are bleeding fast. When you're backing up a production master, the shorter lock time is critical. Our master backups typically complete within about 3-4 seconds during high load periods.
  • No network communication for the snapshot to complete. Using ZFS, snapshot commands don't have to travel over a network to a storage controller to be executed. Fewer moving parts mean greater reliability and fewer failure conditions to handle in your backup scripts.
  • 100GB database restores/rollbacks are lightning quick…typically under 30 seconds. (Unique to the snapshot approach…not ZFS).

However, settling on a backup approach was only part of the battle. Frankly, at the time, there was no commercial backup software that would do what we wanted. MySQL was the red-headed step-child of the backup software world…and to a large degree still is (Zmanda not withstanding). So we rolled our own MySQL/ZFS backup program that we called SnapBack. It's not fancy, and you need to pair it with your own scheduler, but it is highly suited to fast and reliable production backups of MySQL in a ZFS environment. We use it for all sorts of tasks from the original purpose of backing up busy masters, to snapping datasets to build new slaves. SnapBack addresses quite a few of the issues we encountered with the existing open source MySQL backup scripts we found:

  • SnapBack understands how to quiesce MySQL for consistent backups of InnoDB tables (usually avoiding InnoDB recovery on a restore). Most of the open source scripts focus exclusively on MyISAM, and forget about disabling AUTOCOMMIT.
  • SnapBack records the current master log file name and position in the naming of the snapshot (to aid creating replication slaves). Frankly, you can take any SnapBack backup and create a slave from that point-in-time. You don't really need to know you want to do that at the time you pull the backup.
  • SnapBack is aware that InnoDB logs and table space are usually on different zpools for performance, and can snap both pools in a single backup.

All this blathering is really just preface to the fact that we're releasing SnapBack to the world. Hopefully it will save other folks some time, and be a useful tool in their MySQL toolbox. Below you'll find the requirements for SnapBack and a download link. If there's interest in enhancing SnapBack, please feel free as we're releasing it under the BSD license. If there's enough interest, we'll try and post it in a format more conducive to community collaboration. So without further ado….

SnapBack Rev. 1524 Requirements 

Download: SnapBack Rev. 1524