Should Search Engines Break or Fix the Web?

Here’s an interesting thought. So many web pages are only ever accessed as a result of a search engine query, what would happen if search engines stopped including pages that contained broken code?

On the one hand, the most enormous scream from content providers would be heard. On the other hand, there would be a massive push to fix pages with broken code and bring much of the web into line with established public standards. In fact, some information might then be indexed that would have otherwise been unintelligible to the spiders.

And think of the extra impact. No longer will web browser programmers place such high emphasis on maintaining compatibility with broken sources. Web browser development might actually accelerate and release with fewer bugs and smaller download sizes. Software might even become faster.

Google, to take the largest market share holder, even has a site for web masters to receive feedback on their site’s availability for indexing. They really should point out which pages could not be indexed because they had invalid code.

But, I cannot see it happening. Google are doing the reverse and actually going backwards to support the older browsers. Oh well, nice dream.

Infrastructure frameworks

Something’s been bugging me for the past several weeks if not months. Ever since I heard about cloud computing and Amazon’s EC2 / S3 offerings I’ve become more and more convinced that this is going to take off in a big way. Of course, having the odd outage does not help but hopefully the vendor (Amazon) is learning from its mistakes and will not repeat them.

So what’s been bugging me? Well see I’ve seen things from both a PHP programmer’s perspective and from that of a systems administrator. Admittedly neither perspective has experience of operating sites with huge traffic levels but the high availability of them is extremely important regardless.

High availability means both performance and robustness. Performance can be measured by latency – how fast does the correct response get seen by the user/visitor? Robustness is the ability to restore service following a fault. Faults happen – commonly due to a disk failing or other hardware fault, but sometimes due to software not behaving correctly.

Either way, they are two important factors that cannot be underestimated. On the flip side I’m a great believer of the KISS principal – Keep It Simple, Stupid. The simpler something is the less likely it is (on it’s own) going to go wrong.

That said, something that’s simple may not be able to react to external environment changes which may affect it. Take a web page script that connects to a database. One might keep this simple with PHP using a PDO object to connect to a server IP address and perform a query. No heavy working out of case scenarios to work out whether access controls permit the logged in user to execute the query or anything.

But if the database connection fails, the KISS principal reaches its limitations.

Therefore, we operate some robustness business practices. For the databases, we have a master that replicates database write operations to one or more slave servers. If the master dies or needs to be taken out of service, we connect to one of the slaves and use that as the master instead.

OK so our script now needs to have an IP address adjusted. But surely we should be capable of having this automated such that the scenario can be performed multiple times. After all, the newly promoted master will likely eventually die, too.

And it’s not just database servers that die, anything running on a computer may suddenly be declared unfit for use.

Now we have three problems to deal with:

  1. Hardware assets (computers) that fail. We need to launch and maintain new ones to provide sufficient levels of redundancy.
  2. Software packages operate on these computers. They need to be configured and in some cases have roles. A slave must read from a master, for instance.
  3. Management of the above.

I want to take databases as a case point of a specific area I am thinking about. We systems administrators have set up a master and slave(s) without too much fuss. Traditionally, we PHP coders have listed the master (and perhaps slave) within configuration files that our web scripts have loaded, parsed and connected to on each page load.

I argue that the PHP side of things is in reverse.

Imagine a somewhat more complex yet more robust case scenario. You have three machines in circular replication (“master-master”) called A, B and C. For safety’s sake, your PHP script only ever performs write operations on A, the changes are fed and processed by B then C before returning to A which ignores them because it created the actions originally.

We make our PHP accept a list of possible masters. Well, A, B and C are each possible masters. So if our script cannot successfully connect and execute the query on A, it should repeat the process on B and if that also fails on C. If all three fail you can either try again from A, or fail with an error sent to your output.

That solves one problem – which server is master is no longer a matter of concern for us. But that’s not a very scalable way of dealing with growth.

Imagine you’re a popular site experiencing explosive growth and you need to put your Apache web server log files into a database table. Say you’re seeing one million hits per hour. A lot of hits. You’ve set up a number of web servers and listed each’s public IP address as an A record in your domain’s DNS. Roughly load balanced, each server is generating a fair amount of log file activity. A background process is reading these lines from the files and feeding them into a database table.

And now your database table is getting big. By day’s end you’ve hit 50m records and your backup is going to take all day even while the chosen slave is offlin

We have for the purposes of this example five Apache web servers. For each web server we want to write into a database table of it’s own for logs. Now we need five copies of the servers A, B and C. Each set is called a cluster and each cluster is assigned logically to a web server. Web server #1 then gets configured with one set of three IP addresses, server #2 another set of three IPs, etc.

Now we don’t just store log file data in our databases. We have customer data too, lots of it. But the PHP scripts are the same across our five web servers and handle the same customers. Customer A might use server #2 initially but later on server #5. Both servers need access to that customer’s data. But the data is huge, larger than the log files.

So we need to split out customer data too. For this we need to decide on what to split on. Something like the initial character of the customer’s name or something. The detail of this is irrelevant, what’s important is that it is split to provide better speed. But how does our PHP script know which cluster has what customer data?

At this point I will continue in a further article, but suffice it to say I’m thinking more of a persistent resource providing access to specific data sets within a database cluster.

When expire_logs_days has no effect

I’ve spent all morning setting up log rotation in MySQL, except some servers were not erasing old log files.

The idea is to set expire_logs_days to say 7. This means only the logs covering the last seven days will be kept on disk. If you need them, or a slave needs to read from them, they will be kept. But get rid of really old binary files.

You can set this in my.cnf and also at the MySQL console with:

set global expire_logs_days=7

Either the next time you restart mysql, or the next time you issue purge logs or flush logs, MySQL should rotate the current log file (create a new one with the index number incremented) and delete the files not newer than seven days.

Except while the rotation worked, the old files remained on some servers.

It turns out that the binary files had been rm deleted from the filesystem, but the index file not updated (it’s a text file). Issuing show binary logs listed all the old files no longer on the filesystem. As a result, the deletion was failing, as detailed by Baron in this bug (which has turned in to a feature request).

The fix? Stop mysqld, edit the .index file to remove entried no longer present, and restart mysqld. At this point if you have the expire_logs_days entry in your my.cnf file, MySQL will delete the old files as it starts, otherwise you’ll need to issue the flush logs command yourself.

Admittedly the title of this post is misleading – it does have an effect, just not the entire effect as documented.

Second Gmail Outage in a week

Well, I just managed to log in using the Gmail https web interface but the past couple of hours both my accounts have been rejecting me with login failures via IMAP. The http interface was too slow and timed out minutes ago.

This is the second major problem in a week which for Google is a surprising downturn in reliability. Twitter is currently full of complaints as of this second.

Still, it is essentially free.

Amazon Cloud Runs Low on Disk Space

Another unthinkable (maybe in my mind only) has happened – errors uploading files to S3 led me to the AWS status page which reports the US East Coast facilities running low on drive space.

Am I the only one to have assumed someone or some thing was checking constantly, at least hourly, to ensure a sufficient percentage of drive space is available for use?

Apparently they consumed a whole lot more disk space than expected this past week and they are now feverishly adding more capacity. Surely if capacity can be added within hours they should have been gradually adding more during the week..?

This is actually pretty serious. People’s database backup jobs might be failing due to these issues although admittedly they need to be more resilient than that. But then so does Amazon.

UK ENUM Conference

So I attended a conference today held in London to learn about and develop commercial ideas concerning ENUM.

ENUM allows businesses and individuals to publish their telephone number (fixed or mobile) within DNS records so that VoIP clients and providers may look them up and provide a more direct connection to number owners.

The initial goal is reasonably simple, and has to be to gain traction. Imagine the NHS has 500 telephone numbers that it operates as 0800 freephone numbers to allow customer (patients) to contact various local departments. The cost of each minute of every call is borne by the NHS so ultimately by the British taxpayer. Now the NHS also has VoIP connectivity and decides to advertise their 0800 numbers through DNS using ENUM. Subsequently, every time someone using VoIP decides to call any of those 0800 numbers their VoIP provider will find the 0800 number in the ENUM DNS listings for the NHS and will connect the caller to the medical department using VoIP alone – at no cost to either party (usually).

Clearly with this approach there is scope for financial savings. That said, there remains considerable work needed to achieve even this small goal, let alone the potential options further down the road.

In case you were wondering, ENUM is an international standard being implemented by individual countries separately through their respective Governments. The UK Government, through regulator OFCOM, has assigned the design, implementation and ongoing administration of the project to UKEC who, in turn, have contracted much of the work to Nominet. Nominet administer and maintain the .uk gTLD – when you buy any domain ending .uk it is ultimately sold by Nominet although almost always through a reseller (“registrar”) like GoDaddy.

So we now have a basic goal with example and a non-profit company to drive it forward. Part of the reason Nominet were awarded the contract was their intentions to market the ENUM provisioning as a resellable product. And here’s where the majority of blank faces emerged. The audience consisted of any parties interested in becoming ENUM registrars, effectively reselling the service of adding your telephone number to the DNS system. To be more accurate, the audience actually consisted mainly of people in the telecoms and ISP industry wanting to know what ENUM was and whether there was any commercial potential for them or whether it might actually screw them out of their revenue.

The message from Nominet was very clear on one matter. The end is in sight for minute revenues. This means your current fixed line telephony bill of 10p per minute connected to someone with a different geographic area code will be reduced to nothing. Your mobile network tariffs will no longer give you minutes in your bundle as calls to your mates will be free. Don’t ask for a timescale on this although the impatient amongst you could always hook up with VoIP today and extend your reach to your mobile phone provided you can install a VoIP client and connect via WiFi.

To be honest, the Marketing Director of Nominet introduced the commercialisation of ENUM as a set of current ideas rather than anything more concrete. He was, literally, waiting for suggestions from the audience. The common thread that was registration of the number would likely end up free, with registrars making their profits from value-added services. It was suggested one way would be to operator publicly accessible directories of businesses with their advertising online and a simple click to call mechanic.

There are two current matters in my mind that restrict uptake and promotion by business (registrars).

  1. You can list more than just a VoIP endpoint with your telephone number, but what else is currently undefined and may be regulated for privacy reasons. This does have potential for more far reaching consequences
  2. You still cannot obtain a telephone number for life, or extend it. The number you can register have to come from a Communications Provider (CP) like BT. If you move providers can cannot take your number, you’ll have to register your replacement number instead. And because the ENUM system converts a number into DNS (02071234567 becomes – the software will do this for you!) you should be able to extend this yourself by addition additional digits and sending these through to your local phone system just like an automatically dialed extension.
  3. Each registration must go through a verification agency to ensure the registrant really does own the telephone number being registered there will be an additional cost (read: Higher bar to entry).
No doubt business models will emerge from this but for now ENUM remains in the cot after birth, ready for the world to sit up and really take notice and exploit its full potential.

Asterisk and Amazon EC2

Given the clear advantages of cloud computing and the industry momentum (slowly) toward VoIP and complementary technologies (think XMPP) I thought it might prove an interesting exercise to install Asterisk on an Amazon EC2 instance.

My preferred operating system is Debian GNU/Linux. Instances are available with Debian (various versions) pre-installed. Theoretically it should be only a few steps to get Asterisk running.

Here’s where reality kick in. Hard. Asterisk has certain features like conferencing that are attractive and in some cases necessary to have. These features require accurate timing as normally provided by hardware except in this case where we actually have a virtual hardware machine with no telephony equipment connected. To provide a timing substitute Zaptel provide the ztdummy kernel driver.

Which means compiling Zaptel against your currently installed Linux kernel. This cannot be done under Debian. The version of the compiler (gcc) is different to that which compiled the kernel. To compile with the correct, older, gcc, you’ll need to boot the OS Amazon used to compile the kernel.

Over to Fedora Core 4 we head. Now, I managed to compile, install and actually run ztdummy on the Amazon developer image, however by this time I’d really had enough. Suffice it to say I was in no mood to start transferring kernel module files across to my Debian instance to pursue the matter.

There are a couple of people who have written up instructions on getting Asterisk to work on EC2. Neither I believe install the ztdummy kernel module. So they are essentially crippled one way or another.

Amazon: If you are listening, let us sysadmins do what we do best. Let us build our O/S including our own Linux kernel! So much time has been wasted due to this restriction!

Amazon Cloud Computing Alternatives

So there have been plenty of web sites and services affected by today’s big Amazon S3 outage. Smugmug, Twitter, and JungleDisk amongst the casualties to various degrees. Developers have been venting their frustration at seeing their applications fail because of something they relied on.

So what are the alternatives?

Any CTO will tell you that moving parts are your IT department’s weakest link in reliability terms. If you build a company on a single server will you have more, or less, moving parts that building it on a large computing farm as Amazon provides? Such an absolute measurement is of course a waste of time as that one server of course could die at any moment making you wish you’d relied on the cloud. Yet the cloud may also experience downtime.

Amazon does however have the advantage that it hides it’s redundancy from you. If you were to try to match it, you’d likely end up with RAID, and hot standard servers. Trust me, you don’t want to rely on that scenario without spending time and money testing your backup solutions.

So cloud computing might have occasional outages but at least there are engineers on hand 24×7 to fix them on your behalf. All part of the service, Sir. With your own equipment, you are on-call 24×7 shared with your colleagues. Assuming you have some.

Ultimately money can only buy you the best commercially available solutions. Amazon are not the only cloud computing service providers but as they happen to have financial muscle and experience on their side I would go so far as to say they will likely be the best overall. You mileage may vary, naturally.

Remember, Amazon use commodity hardware under the assuming that bits of their network will fail at random. They have constructed software to operate on top of this in a distributed manner to detect failures and try (as best as their programmers can code) mitigate against issues as they arise. I am sure that once analysed the software will be updated to minimise disruption caused by today’s failure as well as similar ones.

But seriously, even Amazon can only go so far. The human brain can only think up so many scenarios and code so many mitigation rules on. Oh, and testing all these situations can also be a real challenge.

It is still a damned site better than relying on your own company to build a similar system in-house.

Amazon Amateurs?

According to iehiapk: “I was under the apparently false impression that S3 was a high-availability service.  We may have to evaluate other services now.  This makes us look like a bunch of amateurs.”

I would like to ask precisely what he defines as a “high-availability service”. Five-nines? Sorry, the Amazon S3 SLA says three nines only. If they are in breach of that (which I suspect they might be now although I’ve yet to calculate or read the fine print) your recourse is a partial refund.

Either way, when you sign the service agreement you accept there will be some risk to service and where conditions are met the supplier will compensate you, all documented and accepted when you signed on.

Amazon S3 Outage (Now Back)

Well I returned to check my giant photos upload that JungleDisk was sending to my Amazon S3 account and it had stopped.

The log showed a whole pile of HTTP error codes which any self-respecting technophile will realise means a serious fault is occurring. The S3 forums document the first errors from 0858PDT although JungleDisk for me reported errors from 1642BST.

There are a few big customers impacted like the photo sharing web site SmugMug who’s displaying an outage page right now and also blogging about the incident. The Amazon Status page does at least confirm what we already know – they’re down and painfully aware of it. Smugmug’s blog says it’s “only” their 3rd outage in over two years which is to be expected. Other major brands will include several Facebook apps loading slowly or displaying errors.

Still, this will hit mainstream press and give cloud computing negative publicity. Hopefully Amazon will learn from this early experiences and continue on the road to virtually bullet-proof hosting. Not many organisations are large enough to put in the resources necessary to build such a robust service and put their brand name against it.

Incidentally, if you have an S3 account, please check their SLA for the procedure to obtain a partial refund…

Updated 2225BST: has broken images due to this, as does Twitter. Amazon report progress toward full restoration of service with internal network communications slowly coming to life.

Updated 2249BST: Amazon are bringing up their S3 web interfaces. Sites and services (like my Jungle Disk backup) should be back up soon. I look forward to their statement on what happened and how they will prevent recurrence.

Updated 2226BST: Amazon S3 EU is back… S3 USA taking a little longer due to larger size.

Updated 0017BST: It’s now Monday and Amazon S3 USA is online once more. Big, big outage.