Should Search Engines Break or Fix the Web?

Here’s an interesting thought. So many web pages are only ever accessed as a result of a search engine query, what would happen if search engines stopped including pages that contained broken code?

On the one hand, the most enormous scream from content providers would be heard. On the other hand, there would be a massive push to fix pages with broken code and bring much of the web into line with established public standards. In fact, some information might then be indexed that would have otherwise been unintelligible to the spiders.

And think of the extra impact. No longer will web browser programmers place such high emphasis on maintaining compatibility with broken sources. Web browser development might actually accelerate and release with fewer bugs and smaller download sizes. Software might even become faster.

Google, to take the largest market share holder, even has a site for web masters to receive feedback on their site’s availability for indexing. They really should point out which pages could not be indexed because they had invalid code.

But, I cannot see it happening. Google are doing the reverse and actually going backwards to support the older browsers. Oh well, nice dream.

Infrastructure frameworks

Something’s been bugging me for the past several weeks if not months. Ever since I heard about cloud computing and Amazon’s EC2 / S3 offerings I’ve become more and more convinced that this is going to take off in a big way. Of course, having the odd outage does not help but hopefully the vendor (Amazon) is learning from its mistakes and will not repeat them.

So what’s been bugging me? Well see I’ve seen things from both a PHP programmer’s perspective and from that of a systems administrator. Admittedly neither perspective has experience of operating sites with huge traffic levels but the high availability of them is extremely important regardless.

High availability means both performance and robustness. Performance can be measured by latency – how fast does the correct response get seen by the user/visitor? Robustness is the ability to restore service following a fault. Faults happen – commonly due to a disk failing or other hardware fault, but sometimes due to software not behaving correctly.

Either way, they are two important factors that cannot be underestimated. On the flip side I’m a great believer of the KISS principal – Keep It Simple, Stupid. The simpler something is the less likely it is (on it’s own) going to go wrong.

That said, something that’s simple may not be able to react to external environment changes which may affect it. Take a web page script that connects to a database. One might keep this simple with PHP using a PDO object to connect to a server IP address and perform a query. No heavy working out of case scenarios to work out whether access controls permit the logged in user to execute the query or anything.

But if the database connection fails, the KISS principal reaches its limitations.

Therefore, we operate some robustness business practices. For the databases, we have a master that replicates database write operations to one or more slave servers. If the master dies or needs to be taken out of service, we connect to one of the slaves and use that as the master instead.

OK so our script now needs to have an IP address adjusted. But surely we should be capable of having this automated such that the scenario can be performed multiple times. After all, the newly promoted master will likely eventually die, too.

And it’s not just database servers that die, anything running on a computer may suddenly be declared unfit for use.

Now we have three problems to deal with:

  1. Hardware assets (computers) that fail. We need to launch and maintain new ones to provide sufficient levels of redundancy.
  2. Software packages operate on these computers. They need to be configured and in some cases have roles. A slave must read from a master, for instance.
  3. Management of the above.

I want to take databases as a case point of a specific area I am thinking about. We systems administrators have set up a master and slave(s) without too much fuss. Traditionally, we PHP coders have listed the master (and perhaps slave) within configuration files that our web scripts have loaded, parsed and connected to on each page load.

I argue that the PHP side of things is in reverse.

Imagine a somewhat more complex yet more robust case scenario. You have three machines in circular replication (“master-master”) called A, B and C. For safety’s sake, your PHP script only ever performs write operations on A, the changes are fed and processed by B then C before returning to A which ignores them because it created the actions originally.

We make our PHP accept a list of possible masters. Well, A, B and C are each possible masters. So if our script cannot successfully connect and execute the query on A, it should repeat the process on B and if that also fails on C. If all three fail you can either try again from A, or fail with an error sent to your output.

That solves one problem – which server is master is no longer a matter of concern for us. But that’s not a very scalable way of dealing with growth.

Imagine you’re a popular site experiencing explosive growth and you need to put your Apache web server log files into a database table. Say you’re seeing one million hits per hour. A lot of hits. You’ve set up a number of web servers and listed each’s public IP address as an A record in your domain’s DNS. Roughly load balanced, each server is generating a fair amount of log file activity. A background process is reading these lines from the files and feeding them into a database table.

And now your database table is getting big. By day’s end you’ve hit 50m records and your backup is going to take all day even while the chosen slave is offlin

We have for the purposes of this example five Apache web servers. For each web server we want to write into a database table of it’s own for logs. Now we need five copies of the servers A, B and C. Each set is called a cluster and each cluster is assigned logically to a web server. Web server #1 then gets configured with one set of three IP addresses, server #2 another set of three IPs, etc.

Now we don’t just store log file data in our databases. We have customer data too, lots of it. But the PHP scripts are the same across our five web servers and handle the same customers. Customer A might use server #2 initially but later on server #5. Both servers need access to that customer’s data. But the data is huge, larger than the log files.

So we need to split out customer data too. For this we need to decide on what to split on. Something like the initial character of the customer’s name or something. The detail of this is irrelevant, what’s important is that it is split to provide better speed. But how does our PHP script know which cluster has what customer data?

At this point I will continue in a further article, but suffice it to say I’m thinking more of a persistent resource providing access to specific data sets within a database cluster.

When expire_logs_days has no effect

I’ve spent all morning setting up log rotation in MySQL, except some servers were not erasing old log files.

The idea is to set expire_logs_days to say 7. This means only the logs covering the last seven days will be kept on disk. If you need them, or a slave needs to read from them, they will be kept. But get rid of really old binary files.

You can set this in my.cnf and also at the MySQL console with:

set global expire_logs_days=7

Either the next time you restart mysql, or the next time you issue purge logs or flush logs, MySQL should rotate the current log file (create a new one with the index number incremented) and delete the files not newer than seven days.

Except while the rotation worked, the old files remained on some servers.

It turns out that the binary files had been rm deleted from the filesystem, but the index file not updated (it’s a text file). Issuing show binary logs listed all the old files no longer on the filesystem. As a result, the deletion was failing, as detailed by Baron in this bug (which has turned in to a feature request).

The fix? Stop mysqld, edit the .index file to remove entried no longer present, and restart mysqld. At this point if you have the expire_logs_days entry in your my.cnf file, MySQL will delete the old files as it starts, otherwise you’ll need to issue the flush logs command yourself.

Admittedly the title of this post is misleading – it does have an effect, just not the entire effect as documented.

Review of Rare Restaurant in Norwich, Norfolk, UK

One week ago, it being my girlfriend’s birthday, I decided we had to visit Rare.

Rare is a grill and stakehouse restaurant in Norwich, Norfolk. It is essentially a restaurant within a hotel, being located inside the Georgian House Hotel at the top of Unthank Road not far from the city centre.

Inside there is a very small reception. To your right a bar area for relaxing which we did not visit. To the right, a larger seating area with branded chairs and modern wallpaper. As you walk though, the kitchen is to your right. Keep walking and you’ll soon reach the end of the room with a door to the rest of the hotel.

There is no specific dress code but smart casual appeared common. The odd suit was also present. The staff dressed in conservative black with aprons and were professional and courteous at all times. Menus (available on their web site) appeared quickly enough revealing pricing in line with expectations given it is somewhat higher class without being out of reach of friends gatherings.

We skipped the starters which were priced around £4-£9. She chose the Salmon while I went for the Sirloin. Food did take a while to arrive, slices of still warm French Stick break with butter and sauce kept us going together with a couple of J2Os.

Main course was actually rather nice, served on modern sturdy plates with good quality cutlery. The taste and general quality of the food came across as high, justifying the list prices. We finished on Sticky Toffee Pudding (her) and warm chocolate fudge cake (me).

Overall we liked the place, the total bill coming to £50 including tip. They use a portable card machine so cash is not necessary.

Screwfix Laser Fluid Extractor – Perfect for oil changes

One word: Awesome.

Earlier this year I decided to tackle the job of changing the oil in my girlfriend’s car myself. This involved making use of my Dad’s trolley jack, axel stands, sawing some small blocks of wood to spread the weight between car body and stands, then actually getting the sump plug undone, and clearing up the mess afterwards. As this was the first time I’d ever done such a thing, the whole procedure took us (Dad and g/f helped) three+ hours.

And yesterday based on mileage it was again necessary. But thanks to a forum thread I decided to check out the Screwfix Laser Fluid Extractor.

This was in stock at the local retail shop on my way home from the office earlier in the week. In fact the guy said he didn’t even need to look up the shelf position because it was so popular he remembered where it was.

How does it work? Well, when you’re ready with the engine nice and hot, remove the oil dipstick and replace with the flexible tub rod connected to bottle in this kit. You then pump the used oil out. I used an engine oil flush fluid to get as much as possible out but that’s up to you.

Constructing it really it easy. Screw the handle into the pump. Place pump onto valve located centre top of the bottle. Place long tube onto side hole. Take out your oil dipstick and place tube in instead until you feel the bottom of the engine. Begin pumping. Nice workout for your arms for 2-3 minutes until you spot air bubbles in the transparent tubing, you know you’re done.

Change the oil filter. Make sure you have a tray or something under the filter as when you unscrew you’ll get a slick of hot oil and ensure you hold it upright as plenty of oil will remain inside. Fit new oil filter and simply replace old oil in car.

Ta-da. Really, really simple and I’m told this is essentially what the garages do anyway. Whole job including reading instructions, ten minutes of warming engine up with oil engine flush, doing the job and refilling it all: 45 minutes to 1 hour. It’ll be a lot quicker next time and I won’t need to read the instructions.

Second Gmail Outage in a week

Well, I just managed to log in using the Gmail https web interface but the past couple of hours both my accounts have been rejecting me with login failures via IMAP. The http interface was too slow and timed out minutes ago.

This is the second major problem in a week which for Google is a surprising downturn in reliability. Twitter is currently full of complaints as of this second.

Still, it is essentially free.

Amazon Cloud Runs Low on Disk Space

Another unthinkable (maybe in my mind only) has happened – errors uploading files to S3 led me to the AWS status page which reports the US East Coast facilities running low on drive space.

Am I the only one to have assumed someone or some thing was checking constantly, at least hourly, to ensure a sufficient percentage of drive space is available for use?

Apparently they consumed a whole lot more disk space than expected this past week and they are now feverishly adding more capacity. Surely if capacity can be added within hours they should have been gradually adding more during the week..?

This is actually pretty serious. People’s database backup jobs might be failing due to these issues although admittedly they need to be more resilient than that. But then so does Amazon.