Should Search Engines Break or Fix the Web?

Here’s an interesting thought. So many web pages are only ever accessed as a result of a search engine query, what would happen if search engines stopped including pages that contained broken code?

On the one hand, the most enormous scream from content providers would be heard. On the other hand, there would be a massive push to fix pages with broken code and bring much of the web into line with established public standards. In fact, some information might then be indexed that would have otherwise beenĀ unintelligibleĀ to the spiders.

And think of the extra impact. No longer will web browser programmers place such high emphasis on maintaining compatibility with broken sources. Web browser development might actually accelerate and release with fewer bugs and smaller download sizes. Software might even become faster.

Google, to take the largest market share holder, even has a site for web masters to receive feedback on their site’s availability for indexing. They really should point out which pages could not be indexed because they had invalid code.

But, I cannot see it happening. Google are doing the reverse and actually going backwards to support the older browsers. Oh well, nice dream.

Infrastructure frameworks

Something’s been bugging me for the past several weeks if not months. Ever since I heard about cloud computing and Amazon’s EC2 / S3 offerings I’ve become more and more convinced that this is going to take off in a big way. Of course, having the odd outage does not help but hopefully the vendor (Amazon) is learning from its mistakes and will not repeat them.

So what’s been bugging me? Well see I’ve seen things from both a PHP programmer’s perspective and from that of a systems administrator. Admittedly neither perspective has experience of operating sites with huge traffic levels but the high availability of them is extremely important regardless.

High availability means both performance and robustness. Performance can be measured by latency – how fast does the correct response get seen by the user/visitor? Robustness is the ability to restore service following a fault. Faults happen – commonly due to a disk failing or other hardware fault, but sometimes due to software not behaving correctly.

Either way, they are two important factors that cannot be underestimated. On the flip side I’m a great believer of the KISS principal – Keep It Simple, Stupid. The simpler something is the less likely it is (on it’s own) going to go wrong.

That said, something that’s simple may not be able to react to external environment changes which may affect it. Take a web page script that connects to a database. One might keep this simple with PHP using a PDO object to connect to a server IP address and perform a query. No heavy working out of case scenarios to work out whether access controls permit the logged in user to execute the query or anything.

But if the database connection fails, the KISS principal reaches its limitations.

Therefore, we operate some robustness business practices. For the databases, we have a master that replicates database write operations to one or more slave servers. If the master dies or needs to be taken out of service, we connect to one of the slaves and use that as the master instead.

OK so our script now needs to have an IP address adjusted. But surely we should be capable of having this automated such that the scenario can be performed multiple times. After all, the newly promoted master will likely eventually die, too.

And it’s not just database servers that die, anything running on a computer may suddenly be declared unfit for use.

Now we have three problems to deal with:

  1. Hardware assets (computers) that fail. We need to launch and maintain new ones to provide sufficient levels of redundancy.
  2. Software packages operate on these computers. They need to be configured and in some cases have roles. A slave must read from a master, for instance.
  3. Management of the above.

I want to take databases as a case point of a specific area I am thinking about. We systems administrators have set up a master and slave(s) without too much fuss. Traditionally, we PHP coders have listed the master (and perhaps slave) within configuration files that our web scripts have loaded, parsed and connected to on each page load.

I argue that the PHP side of things is in reverse.

Imagine a somewhat more complex yet more robust case scenario. You have three machines in circular replication (“master-master”) called A, B and C. For safety’s sake, your PHP script only ever performs write operations on A, the changes are fed and processed by B then C before returning to A which ignores them because it created the actions originally.

We make our PHP accept a list of possible masters. Well, A, B and C are each possible masters. So if our script cannot successfully connect and execute the query on A, it should repeat the process on B and if that also fails on C. If all three fail you can either try again from A, or fail with an error sent to your output.

That solves one problem – which server is master is no longer a matter of concern for us. But that’s not a very scalable way of dealing with growth.

Imagine you’re a popular site experiencing explosive growth and you need to put your Apache web server log files into a database table. Say you’re seeing one million hits per hour. A lot of hits. You’ve set up a number of web servers and listed each’s public IP address as an A record in your domain’s DNS. Roughly load balanced, each server is generating a fair amount of log file activity. A background process is reading these lines from the files and feeding them into a database table.

And now your database table is getting big. By day’s end you’ve hit 50m records and your backup is going to take all day even while the chosen slave is offlin

We have for the purposes of this example five Apache web servers. For each web server we want to write into a database table of it’s own for logs. Now we need five copies of the servers A, B and C. Each set is called a cluster and each cluster is assigned logically to a web server. Web server #1 then gets configured with one set of three IP addresses, server #2 another set of three IPs, etc.

Now we don’t just store log file data in our databases. We have customer data too, lots of it. But the PHP scripts are the same across our five web servers and handle the same customers. Customer A might use server #2 initially but later on server #5. Both servers need access to that customer’s data. But the data is huge, larger than the log files.

So we need to split out customer data too. For this we need to decide on what to split on. Something like the initial character of the customer’s name or something. The detail of this is irrelevant, what’s important is that it is split to provide better speed. But how does our PHP script know which cluster has what customer data?

At this point I will continue in a further article, but suffice it to say I’m thinking more of a persistent resource providing access to specific data sets within a database cluster.