Just a quick set of thoughts…do cloud-scale DBs save money because they’re based on commodity/cheap servers? Tonight I did some rough back-of-the-pad calculations, and was kind of surprised…
Let’s assume we’ve got an 11TB working set of data, how could we store this redundantly?
(cloud servers in these examples are dedicated servers at a cloud provider)
Option 1: Two beefy storage servers running MySQL in a master/slave config
Option 2: 28 commodity servers (2 replica copies for each piece of data) running HBase or Cassandra
Option 3: 42 commodity servers (3 replica copies for each piece of data) running HBase or Cassandra
Now the issue here that surprised me isn’t the raw cost differential between stuffing your own hardware in your colo or using a cloud provider. And the other thing is, I’m not picking on SoftLayer…Rackspace and Voxel all work out to the same cost scaling as SoftLayer (and in the case of the other two vendors worse).
What surprised me:
As I said, I’m not comparing the cost of running Option 1 on your own hardware vs. Option 1 at a cloud provider. Yes those costs are more at the cloud provider, but it’s to be expected (they’re bundling in bandwidth, colo, power, and most importantly people to manage the hardware and network).
What’s stunning is that beefy servers at a cloud provider are much more cost efficient. Beefy cloud servers cost you roughly 1/15 of the cost of the hardware every month. Whereas, “cheap” commodity cloud servers cost you roughly 1/3 of the cost of the hardware every month. Much higher mark up on the cheaper volume servers.
Please comment and correct me if I’m wrong in my analysis…I would actually like to be.
Jonathan Ellis
March 18th, 2010 at 9:12 am
You’re not getting apples to apples here. The amount of disk is similar but you have an order of magnitude more CPU and RAM in the scale-out hardware, which is huge for many workloads.That said, I also blogged recently about why cloud VMs aren’t usually the right fit vs bare metal.
Jason
March 18th, 2010 at 3:48 pm
The question is, how do you get 11TB of highly available database. And as noted in the article, the servers used at the cloud provider were dedicated servers not cloud VMs per se. If you have a large dataset, something like Cassandra will cost you disproportionally more money than using MySQL vertically scaled and then partitioned (when deployed at a cloud provider). Alternately, you could vertically scale and use 3 beefy Cassandra nodes, but that obviates the point of the architecture.My real goal with this posting was to get cloud providers to realize they’ve got to cut the costs on their 1U servers if things like Cassandra are going to be affordable when deployed in the cloud with multi-TB datasets. The 1U costs are way out of line with what the same cloud provider charges for the beefier box. If you bundle colo space/power/bandwidth into the costs of the 1U and 4U servers deployed on your own you’re still at a 1/5 ratio for 1Us vs 1/11 ratio for 4Us (ratio being calculated as (cost per month at cloud provider/cost of hardware itself plus colo/power/bandwidth). Ideally, 1U or 4U beefy the ratio would be the same.
Alex Leverington
June 19th, 2010 at 4:43 am
I believe your analysis is accurate and it makes sense: If you have beefy data processing needs, it’s more efficient to use beefy servers. In answer to your query about why “cheap” (in bulk) servers cost more, it’s because they actually cost more! I’ll explain…For 28 commodity servers you’re looking at powering 2 racks (@110v) in a climate controlled room. At 3amps, the electric bill to cool and power the servers would be around $1300/mo. What consumes 3amps when a 100W processor is < 1 amp?
The power consumed by servers are for the CPU, Fans, and Disks. When you take all those disks and cram them into one machine, based on your example @28 servers, you’re eliminating 104 CPU cores, 26 CPU Fans, and probably 104 system fans — bringing power usage down to about $48/mo rather than $1300/mo.
If all you want to do is store 11TB of data, it’s not a matter of beefy server + rdbms or many small servers + cassandra. The benefit of those 28 servers + cassandra is that cassandra will not need more than a single core on each of those servers and you can probably use 2-3 of the other cores for distributing computations. If you don’t need that kind of process distribution, chances are you can do all your processing with an RDBMS and you’re better off with one beefy server. In most cases, if someone was spanning their 11TB of data across 28 servers, it’s because they can optimally split up their dataset into 14 segments.
In a nutshell, most folks who expand out from the “beefy database” infrastructure do so because their data processing needs exceed the CPU, memory, or network capacity of one server. In most cases, the limitation of a beefy server isn’t the amount of storage available but the latency of random seek access and ability of the system to quickly store+index new data.
Jason
July 23rd, 2010 at 7:24 pm
You’re right in your comments…my point is the hype is way overblown as to where that tipping point of outgrowing an RDBMS is. Also, even accounting for power the markup on the smaller servers is extremely high.