Amazon Cloud Runs Low on Disk Space

Another unthinkable (maybe in my mind only) has happened – errors uploading files to S3 led me to the AWS status page which reports the US East Coast facilities running low on drive space.

Am I the only one to have assumed someone or some thing was checking constantly, at least hourly, to ensure a sufficient percentage of drive space is available for use?

Apparently they consumed a whole lot more disk space than expected this past week and they are now feverishly adding more capacity. Surely if capacity can be added within hours they should have been gradually adding more during the week..?

This is actually pretty serious. People’s database backup jobs might be failing due to these issues although admittedly they need to be more resilient than that. But then so does Amazon.

Amazon Cloud Computing Alternatives

So there have been plenty of web sites and services affected by today’s big Amazon S3 outage. Smugmug, Twitter, WordPress.com and JungleDisk amongst the casualties to various degrees. Developers have been venting their frustration at seeing their applications fail because of something they relied on.

So what are the alternatives?

Any CTO will tell you that moving parts are your IT department’s weakest link in reliability terms. If you build a company on a single server will you have more, or less, moving parts that building it on a large computing farm as Amazon provides? Such an absolute measurement is of course a waste of time as that one server of course could die at any moment making you wish you’d relied on the cloud. Yet the cloud may also experience downtime.

Amazon does however have the advantage that it hides it’s redundancy from you. If you were to try to match it, you’d likely end up with RAID, and hot standard servers. Trust me, you don’t want to rely on that scenario without spending time and money testing your backup solutions.

So cloud computing might have occasional outages but at least there are engineers on hand 24×7 to fix them on your behalf. All part of the service, Sir. With your own equipment, you are on-call 24×7 shared with your colleagues. Assuming you have some.

Ultimately money can only buy you the best commercially available solutions. Amazon are not the only cloud computing service providers but as they happen to have financial muscle and experience on their side I would go so far as to say they will likely be the best overall. You mileage may vary, naturally.

Remember, Amazon use commodity hardware under the assuming that bits of their network will fail at random. They have constructed software to operate on top of this in a distributed manner to detect failures and try (as best as their programmers can code) mitigate against issues as they arise. I am sure that once analysed the software will be updated to minimise disruption caused by today’s failure as well as similar ones.

But seriously, even Amazon can only go so far. The human brain can only think up so many scenarios and code so many mitigation rules on. Oh, and testing all these situations can also be a real challenge.

It is still a damned site better than relying on your own company to build a similar system in-house.

Amazon Amateurs?

According to iehiapk: “I was under the apparently false impression that S3 was a high-availability service.  We may have to evaluate other services now.  This makes us look like a bunch of amateurs.”

I would like to ask precisely what he defines as a “high-availability service”. Five-nines? Sorry, the Amazon S3 SLA says three nines only. If they are in breach of that (which I suspect they might be now although I’ve yet to calculate or read the fine print) your recourse is a partial refund.

Either way, when you sign the service agreement you accept there will be some risk to service and where conditions are met the supplier will compensate you, all documented and accepted when you signed on.

Amazon S3 Outage (Now Back)

Well I returned to check my giant photos upload that JungleDisk was sending to my Amazon S3 account and it had stopped.

The log showed a whole pile of HTTP error codes which any self-respecting technophile will realise means a serious fault is occurring. The S3 forums document the first errors from 0858PDT although JungleDisk for me reported errors from 1642BST.

There are a few big customers impacted like the photo sharing web site SmugMug who’s displaying an outage page right now and also blogging about the incident. The Amazon Status page does at least confirm what we already know – they’re down and painfully aware of it. Smugmug’s blog says it’s “only” their 3rd outage in over two years which is to be expected. Other major brands will include several Facebook apps loading slowly or displaying errors.

Still, this will hit mainstream press and give cloud computing negative publicity. Hopefully Amazon will learn from this early experiences and continue on the road to virtually bullet-proof hosting. Not many organisations are large enough to put in the resources necessary to build such a robust service and put their brand name against it.

Incidentally, if you have an S3 account, please check their SLA for the procedure to obtain a partial refund…

Updated 2225BST: WordPress.com has broken images due to this, as does Twitter. Amazon report progress toward full restoration of service with internal network communications slowly coming to life.

Updated 2249BST: Amazon are bringing up their S3 web interfaces. Sites and services (like my Jungle Disk backup) should be back up soon. I look forward to their statement on what happened and how they will prevent recurrence.

Updated 2226BST: Amazon S3 EU is back… S3 USA taking a little longer due to larger size.

Updated 0017BST: It’s now Monday and Amazon S3 USA is online once more. Big, big outage.

Jungle Disk Monitor

Decided to check out Amazon S3 and it’s practical uses first. This is Jungle Disk.

Jungle Disk is like traditional software in that it is downloaded and run by your desktop computer. It gives a point-and-click interface to select which files and folders to back up and over what schedule (if any). The above screenshot was taken during an (easy-peesy) initial backup of my Documents folder in version 2.02

There are a number of tweaks too such as bandwidth limiting.

Interestingly, you can also “mount” the S3 service as a disk drive. In the above picture I can double-click the JungleDisk icon on the desktop and open my S3 storage account within Mac OS X Finder.

Jungle Disk is available for Linux, too. Which means it will handle a mounted connection to S3 for your servers. Think of the possibilities…