Posted by Jason | Posted in DigiTar, Solaris, Technology | Posted on 05-28-2008
Tomorrow (05/28/2008) I'm giving a talk on moving to open storage (i.e. ethernet, OpenSolaris and SATA…in no particular order) at the Diocesan Information Systems Conference in San Antonio. It's a closed event, but here are the slides from the talk…including the talking notes which cover a lot more than I'll probably have time for:
As a company that was heavily populated with Linux zealots, it’s been surreal for us to watch OpenSolaris develop for the past 3 years. While technologies like DTrace and FMA are features we now use everyday, it was storage that brought Solaris into our environment and continues to drive it deeper into our services stack. Which begs the question: Why? Isn’t DTrace just as cool as ZFS? Haven’t Solaris Containers dramatically changed the way we provision and utilize systems? Sure…but storage is what drives our business and it doesn’t seem to me that we’re alone.
Everything DigiTar does manipulates or massages messaging in some way. When most people think of what drives our storage requirements they think of quarantining or archiving e-mail. But when you’re dealing with messages that can make or break folks’ businesses, logging the metadata is perhaps the most important thing we do.
Metadata is flooding in every second. It’s at the center of everything from proving a message was delivered to ensuring we meet end-to-end processing times and SLAs. If we didn’t quarantine any more messages, we’d still generate gigabytes of data every day that can’t be lost. Without reliable and scalable storage we wouldn’t exist.
Lost IOPs, Corruption and Linux…oh my!
What got us using OpenSolaris was Linux’s (circa 2005) unreliable SCSI and storage subsystems. I/Os erroring out on our SAN array would be silently ignored (not retried) by Linux, creating quiet corruption that would require fail-over events. It didn’t affect our customers, but we were going nuts managing it. When we moved to OpenSolaris, we could finally trust that no errors in the logs literally meant no errors. In a lot of ways, Solaris benefits from 15 years of making mistakes in enterprise environments. Solaris anticipates and safely handles all of the crazy edge cases we’ve encountered with faulty equipment and software that’s gone haywire.
When it comes to storing data, you’ll pry OpenSolaris (and ZFS) out of our cold dead hands. We won’t deploy databases on anything else.
While we moved to Solaris to get our derriÃ¨res out of a sling, being on OpenSolaris has dramatically changed the way we use and design storage.
When you’ve got rock-solid iSCSI, NFS, and I/O multipathing implementations, as well as a file system (ZFS) that loves cheap disks…and none of it requires licensing…you can suddenly do anything. Need to handle 3600 non-cached IOPs for under $60K? No problem. Have an existing array but can’t justify $10K for snapshotting? No problem. How ‘bout serving line-rate iSCSI with commodity storage and CPUs? No problemo.
That’s the really amazing thing about OpenSolaris as a storage platform. It has all of the features of an expensive array and because it allows you to build reliable storage out of commodity components, you can build the storage architecture you need instead of being held hostage by the one you can afford. But features like ZFS don’t mandate that you change your architecture. You can pick and choose the pieces that fit your needs and make any existing architecture better too.
So how has OpenSolaris changed the way DigiTar does storage? For one thing, it’s enabled us to move almost entirely off of our fibre-channel SAN. We get better performance for less money by putting our database servers directly on Thumpers (Sun Fire X4500) and letting ZFS do its magic. Also, because its ZFS, we’re assured that every block can be verified for correctness via checksumming. By doing application-level fail-over between Thumpers, we get shared-nothing redundancy that has increased our uptime dramatically.
One of the things that always has bugged me about traditional clustering is its reliance on shared storage. That’s great if the application didn’t trash its data while crashing to the ground. But what if it did? To replicate the level of redundancy we get with two X4500s, we’d have to install two completely separate storage arrays…not to mention also buy two very large beefy servers to run the databases. By using X4500s, we get the same reliability and redundancy for about 85% less cost. That kind of savings means we can deploy 6.8x more storage for the same price footprint and do all sorts of cool things like:
- Create multiple data warehouses for data mining spam and mal-ware trends.
- Develop and deploy new service features whenever we want without considering storage costs.
- Be cost competitive with competitors 10x our size.
Whether you’re storing pictures of your kids, or archiving business critical e-mail (or anything in between), it seems to me that being able to store massive amounts of data reliably is as fundamental to computing today as breathing is to living. OpenSolaris allows us as a company to stop worrying about what its going to cost to store the results of our services, and focus on what’s important: developing the services and features themselves. When you stop focusing on the cost of “air”, you’re liberated to actually make life incredible.
I could continue blathering about how free snapshotting (both in terms of cost and performance hit) can allow you to re-organize your backup priorities, or a bunch of other very cool benefits of using OpenSolaris as your storage platform. But you should give it a shot yourself, because OpenSolaris’ benefits are as varied and unique as your environment. Once you give it a try, I think you’ll be hard pressed to go back to vendor lock-in…but I’m probably a bit biased now. I think you’ll also find an community around OpenSolaris that is by far the friendliest and most mature open source group of folks you’ve ever dealt with.
Posted by Jason | Posted in Solaris | Posted on 03-07-2008
So you've got a shiny new Indiana DP2 install and you want to load Sun Studio. Hypothetically, let's say you're looking to compile mod_python. Then, half-way through the Sun Studio 12 install you hit a wall:/usr/sbin/patchadd doesn't exist. What's a boy to do?
The problem is that the SUNWswmt package which contains patchadd was omitted accidentally from the DP2 repository. The solution is to grab SUNWswmt and a few other dependencies from SXDE 01/08. But downloading 3 gigs of ISO to retrieve 1MB of packages seems like a little bit of overkill. So here is a tarball containing just the packages you need: <Temporarily Removed>
- Unpack <Temporarily Removed> in the directory of your choice (and change to the root of that directory).
- Run: pfexec install SUNWadmc
- Run: pfexec pkgadd -d . SUNWpkgcmdsr (ignore any overwrite warnings and continue)
- Run: pfexec pkgadd -d . SUNWpkgcmdsu
- Run: pfexec pkgadd -d . SUNWinstall-patch-utils-root
- Run: pfexec pkgadd -d . SUNWswmt
Bada bing, bada boom you should be in business. Just install Sun Studio 12 normally at this point.
Posted by Jason | Posted in Solaris | Posted on 01-22-2008
I'm giving a presentation today at the Boise BizExpo about using OpenSolaris as the foundation for your storage infrastructure. It's a partial re-work of my ZFS talk at FORUM 2007, but focuses more on building a NAS/SAN platform out of OpenSolaris and less on ZFS in particular. If you'd like to see it in person, it's at 11:00am MST (01/23/2008) at the Boise BizExpo (Boise Center on the Grove, Room 4). Hopefully it's helpful to someone…and folks with forgive the stuttering. The actual slides can be downloaded here: Liberating Storage with OpenSolaris
Posted by Jason | Posted in Solaris | Posted on 01-17-2008
Awhile back (~July 2006) we moved our core MySQL clusters to ZFS in order to…among other things…simplify our backup regimen. Nearly two years later, I can honestly say I'm in love with ZFS mostly because of how much its simplified and shored-up our MySQL infrastructure. ZFS and MySQL go together like chips and salsa.
Now, backing up live MySQL servers is a bit like trying to take a picture of a 747's fan blades while it's in flight. There's basically three camps of MySQL backups:
- Use mysqldump, which locks your tables longer than a Tokyo stoplight.
- Quickly quiesce/lock your tables and pull a volume snapshot.
- Or….use InnoBase Hot Backup. Outside of being licensed per server, ibbackup works very well and allows InnoDB activity to continue uninterrupted. The only downside is if you have to backup InnoDB tables (lock city!).
The mysqldump method worked best when our apps were in the development and testing phase…i.e. the datasets were tiny. However, once you get above a few hundred megabytes, the server had better be a non-production one. Frankly, for our infrastructure, the snapshot method was the most appealing. Among other reasons, it didn't require per-server licensing, and was the best performing option for our mixed bag of InnoDB and InnoDB tables. Initially, the plan was to use the snapshot feature in our shiny new STK arrays…however, just about the time our new arrays arrived, ZFS was released in a reasonably stable form. While snapshotting is not unique to ZFS (and it is a widely used approach to MySQL backups), there are a few benefits to relying on ZFS snapshots for MySQL backups:
- ZFS snapshots are bleeding fast. When you're backing up a production master, the shorter lock time is critical. Our master backups typically complete within about 3-4 seconds during high load periods.
- No network communication for the snapshot to complete. Using ZFS, snapshot commands don't have to travel over a network to a storage controller to be executed. Fewer moving parts mean greater reliability and fewer failure conditions to handle in your backup scripts.
- 100GB database restores/rollbacks are lightning quick…typically under 30 seconds. (Unique to the snapshot approach…not ZFS).
However, settling on a backup approach was only part of the battle. Frankly, at the time, there was no commercial backup software that would do what we wanted. MySQL was the red-headed step-child of the backup software world…and to a large degree still is (Zmanda not withstanding). So we rolled our own MySQL/ZFS backup program that we called SnapBack. It's not fancy, and you need to pair it with your own scheduler, but it is highly suited to fast and reliable production backups of MySQL in a ZFS environment. We use it for all sorts of tasks from the original purpose of backing up busy masters, to snapping datasets to build new slaves. SnapBack addresses quite a few of the issues we encountered with the existing open source MySQL backup scripts we found:
- SnapBack understands how to quiesce MySQL for consistent backups of InnoDB tables (usually avoiding InnoDB recovery on a restore). Most of the open source scripts focus exclusively on InnoDB, and forget about disabling AUTOCOMMIT.
- SnapBack records the current master log file name and position in the naming of the snapshot (to aid creating replication slaves). Frankly, you can take any SnapBack backup and create a slave from that point-in-time. You don't really need to know you want to do that at the time you pull the backup.
- SnapBack is aware that InnoDB logs and table space are usually on different zpools for performance, and can snap both pools in a single backup.
All this blathering is really just preface to the fact that we're releasing SnapBack to the world. Hopefully it will save other folks some time, and be a useful tool in their MySQL toolbox. Below you'll find the requirements for SnapBack and a download link. If there's interest in enhancing SnapBack, please feel free as we're releasing it under the BSD license. If there's enough interest, we'll try and post it in a format more conducive to community collaboration. So without further ado….
SnapBack Rev. 1524 Requirements
- Solaris 6/06 or later…or…any recent build of OpenSolaris that has ZFS.
- Python 2.4
- MySQL for Python (aka. MySQLdb)
- MySQL Client libraries
Download: SnapBack Rev. 1524
Nearly a year has passed since our descent into the 9th ring of latency Hades, and I wanted to make an update post on ZFS' interaction with SAN arrays containing battery-backed cache. (For the full details, please check out this older post.)
For one thing, the instructions I previously gave to ignore cache flushes on the STK FLX200/300 series (and similar LSI OEM'd products), don't seem to work very well on the new generation Sun StorageTek 6×00 arrays. Not to mention it's kind of nasty to have to modify your array's NVRAM settings to get good write latency.
But thanks to the brilliant engineers on the ZFS team, you no longer have to modify your array (since circa May '07 in the OpenSolaris tree). Simply add this line to your Solaris /etc/system file and ZFS will no longer issue SYNCHRONIZE CACHE commands to your array:
I can confirm that this works REALLY well on both the older (FLX200/300) and newer (6140/6540) Sun/Engenio arrays! It seems to me that since the new way is a ZFS configuration directive, it should be portable/functional against any array in existence. Please note that setting this directive will disable cache flushing for ALL zpools on the system, which would be dangerous for any zpools using local disks. As always, caveat emptor. Your mileage may vary so please do let others know through the comments what works/doesn't work for you.
We've tested the zfs:zfs_nocacheflush directive successfully in Build 72 of OpenSolaris. It should also work in Solaris 10 Update 4, though we haven't tested that ourselves.
Today (10/10), by the graciousness of Sun, I've been given the opportunity to speak on ZFS and all that its meant to us as a company. Overall, ZFS is truly an amazing and unique technology that can transform any company in ways tailored to it. If you'd like to see the talk live, I'll be speaking again tomorrow (10/11) at Sun's FORUM 2007 located at the Adam's Mark Hotel in Denver, CO. The talk will be at 1:15PM MST. (I believe registration is required.)
If you happen to want to look at my blathering for free, I've uploaded my presentation (with talking notes) here.
NRPE (Nagios Remote Plug-In Executor) is a critical part of a lot of IT environments. In ours it provides to Nagios all sorts of interesting health info local to the host that NRPE is running on. Whether its RAM, open connections, hard drive space or something else, NRPE helps alert you to strange happenings that simply interrogating a TCP port remotely won't provide. Hence, its a deal breaker to moving to OpenSolaris if you can't have it. Luckily, the benevolent gents at Blastwave provide
a pre-packaged NRPE that's ready to go (Run: pkg-get -i nrpe). Unfortunately, the Blastwave NRPE package leaves the last step of placing it under init.d or SMF control as an exercise for the admin. Well, if you're like me and would like SMF to be able to manage NRPE, then you're in luck. Below are a manifest and installation instructions that will start, stop and refresh an NRPE daemon (as installed from
the Blastwave package).
Its important to note that this NRPE manifest will expect your NRPE configuration to be in /opt/csw/etc/nrpe.cfg and that it will contain the line: pid_file=/var/run/nrpe.pid If your config file is in a different location, just edit method/nagios-nrpe in the manifest package to match where your nrpe.cfg lives. If for some reason you don't want to specify pid_file in your nrpe.cfg, then the refresh method
will not operate properly. The start and stop methods will operate whether you specify a pid_file value or not. Technically, just restarting the NRPE daemon will accomplish the same thing as the refresh method, which just sends a SIGHUP to the NRPE daemon. Again, caveat emptor. This manifest and the installation instructions below are provided with absolutely no warranty whatsoever as specified in the
BSD license in the manifest header.
To install the manifest please follow these steps:
- Download the NRPE manifest package here.
- Unpack the package on your system.
- Change to the root of the unpacked package.
- Run: cp ./manifest/nagios-nrpe.xml /var/svc/manifest/network/
- Run: cp ./method/nagios-nrpe /lib/svc/method/
- Run: svccfg import /var/svc/manifest/network/nagios-nrpe.xml
- You're done!
If everything went smoothly, running svcadm enable nrpe should start the daemon without incident. Similarly, svcadm disable nrpe should kill it. As mentioned before, there's also svcadm refresh nrpe, which will send a SIGHUP to NRPE. That will cause NRPE to re-read its nrpe.cfg file. An interesting note on refresh is that NRPE will reliably crash on a second SIGHUP. If you were using standard init.d,
this could really hose you, as NRPE would randomly terminate and you wouldn't know. With SMF however, it doesn't matter! If NRPE dies when you send it a SIGHUP, SMF will loyally restart the daemon for you. Another reason to use SMF with all of your critical services, where an automatic restart won't risk data corruption! Hope y'all find this of use!
One of the quiet parts that powers our logistics infrastructure is PowerDNS. Its a very powerful way to serve DNS records that you need the ability to update programmatically. Unfortunately, OpenSolaris (or Solaris 10 for that matter) isn't exactly considered kosher over in PowerDNS-land. Like a lot of OSS projects, PDNS hasn't kept up with the times and treats OpenSolaris like a red-headed step-child. If you like red-headed step-children like we do, then you're in for about
8 hours of greasing, coaxing and pleading to get it compiled right. Well either that…or you can read on and get it up in about a 30 minutes.. As a side-bonus, you'll also have PDNS managed by the coolest way ever invented to replace init.d: SMF.
Installing PDNS on OpenSolaris/Solaris 10 x64…
First thing you'll need to do is get Blastwave installed on your Solaris box. You could try and build the unholy abomination that is Boost on your own…but then you're a braver soul than I. As its getting late, please excuse that the steps are brief and bulleted (feel free to harass me if you have questions):
- Make sure your path is set correctly. This path will do nicely: PATH=/usr/sbin:/usr/bin:/opt/csw/bin:/usr/sfw/bin:/usr/ccs/bin
- You'll need all the dev tools that come with a standard Solaris 10/OpenSolaris install…make, gcc, g++, ld etc. (You don't need Studio 11 installed. In fact, PDNS will really NOT like Studio 11 so please use gcc 3.3 or later).
- Run: pkg-get -i mysql5client
- Run: pkg-get -i mysql5devel
- Run: pkg-get -i boost_rt
- Run: pkg-get -i boost_devel
- Run: ln -s /opt/csw/mysql5/lib/mysql /usr/lib/mysql (This will make pathological configure scripts work a lot more smoothly.)
- Run: crle -l /lib:/usr/local/lib:/opt/csw/lib:/usr/lib:/opt/csw/mysql5/lib (This will help your compiled PDNS binaries find all the libraries they need at runtime. Run crle by itself first to see if there are any additional paths on your system that need to be present on this list. Caveat emptor..you run this command at your own risk as it can really bork your system if you don't know what you're doing.)
- Unpack the latest PDNS sources which you can get here (these instructions are known to work against 2.9.20).
- From within the PDNS source tree root run: ggrep -R “u_int8_t” *
- Manually change all the u_int8_t references that grep finds to uint8_t. If you don't do this, good 'ol crotchety PDNS will not compile. (I know I should provide a patch. I'll try and do that in the next couple of days if possible).
- From the PDNS source tree root run: ./configure –localstatedir=/var/run –with-pic –enable-shared –with-mysql-includes=/opt/csw/mysql5/include/ CXXFLAGS=”-I/opt/csw/include -DSOLARIS” LDFLAGS=”-L/opt/csw/lib -lsocket -lnsl”
- Run: make install (This will use the prefix /usr/local/ to install everything. The SMF manifest later will expect your pdns.conf to be in /usr/local/etc/ as a result. For sanity purposes on our systems, we also symlink pdns.conf into /etc.)
- Bingo! Presto! You have a working PDNS server…hopefully.
Life support for PDNS…that is running PDNS under SMF…
Service Management Facility (SMF) is a truly wonderful thing. It completely replaces init.d and inet.d, gives you a standard way of managing both types of services, understands dependencies, restarts dead services…and washes your car while you sleep. The only hiccough is you've got to write a manifest to run PDNS under SMF…or use the one below. Again…caveat emptor…this SMF manifest comes with absolutely no warranty at all. Read the BSD license
header at the top of the manifest for a complete description of how much its your own darn fault if this manifest totals your system. The DigiTar SMF manifest for PDNS has a couple of neat integration features:
- If PDNS is already started when you run svcadm enable powerdns, it will error out such that SMF will mark PDNS' service description into a maintenance state, and will place an informative message in the PDNS SMF service log.
- If you accidentally delete the pdns_server binary, SMF will not let you start the service and will place it into a maintenance state so you know something is wrong.
- Running svcadm refresh powerdns will instruct PDNS to scan for new domains that have been added (pdns_control rediscover), as well as rescan for changes to records in existing domains (pdns_control reload).
OK, enough jabbering. Here's how you install the SMF manifest:
- Download the DigiTar PowerDNS SMF package here.
- Unpack the package on your system.
- Change to the root of the unpacked package.
- Run: cp ./manifest/dns-powerdns.xml /var/svc/manifest/site/
- Run: cp ./method/dns-powerdns /lib/svc/method/
- Run: svccfg import /var/svc/manifest/site/dns-powerdns.xml
- You're done!
You should now be able to start your PDNS server with a simple svcadm enable powerdns. Stopping PDNS is similarly simple: svcadm disable powerdns. If you just want to see the state of the PDNS service try svcs powerdns. That's it! You can sleep well at night knowing if PDNS goes the way of all flesh, SMF will auto-restart it for you. Try a pkill pdns and watch the process IDs change. If you're PDNS service won't start take a look at svcs
-x to see why. Anywho…off to the sand man for me. If you have any questions, please feel free to contact me: williamsjj_@_digitar.com