How we migrated a PHP website with hundreds of thousands of urls and 17 years worth of data

Millions of database entries, hundreds of thousands of users, lots of lessons learned.

Image for blog post How we migrated a PHP website with hundreds of thousands of urls and 17 years worth of data

What do you do when you have a website that receives a large volume of traffic from search engines but desperately needs moving to a modern web framework and in the process would need hundreds of thousands of urls re-writing?

You probably look into the options for an hour or so, then eventually decide it's just not worth the risk and try to forget about the fact that the longer you leave it the bigger the job will be.

We recently took on the task of moving the popular website All In London to a modern web framework.

All In London logo

About All In London

The website was launched in 2003 as a city guide to London. It started out with listings for attractions, restaurants and events and quickly started adding more and more listing types over the next few years. There's actually a great gallery on the website showing the history of the design, it'll give you a good idea of the kind of site it is, informal but packed with information.

The site received over 6 million page reads per month at its peak and is still getting large numbers of readers now. The rise of social media ate into the website's traffic but it still has a loyal user base with over 180,000 user accounts and consistent return visitor usage, for anyone living in London it's a known brand.

The old website architecture

The previous version of the site was built entirely in PHP with a MySQL database backend. There was no framework code and the site was actually a kind of hybrid static/dynamic setup. The majority of the listings data was held locally on a linux box at the All In London office and at regular intervals the website was 'milled' by a PHP script that built 90% of the website as .php files. The milled pages were copied into the site root which consisted of the various images, css, js and php includes for running the dynamic parts of the site.

This kind of architecture was popular back in the late 90s when dynamic websites were quite new and the best way to build a site with thousands of pages was to actually create the pages! This was all created on PHP4 and some of the scars from that period can still be seen in the code in workarounds and hacks, including the infamous PHP magic quotes!

This system held up well for 17 years and dealt with some huge traffic volumes, lots of businesses (listings) were very successful on the website (infact some were built on top of the success they had on All In London) and lots of people got paid for working on the project. That said, it was clear that the site could do with being moved to a platform that would allow it to keep going into the future. Recent changes in the PHP language had led to some hacks being introduced into the code and it was at the point where the owners were nervous about making changes.

The new architecture

We looked at a few options for building the new version of All In London. The main aim of the upgrade was to remove the local database requirement and have the entire site running from the production database. This would allow it to react very quickly to requests for updates. It would also remove the bottleneck of needing somebody within the office to make changes to some of the listing data.

We first looked at using a simple framework like Slim PHP to allow us to handle all of the routing and then create templates for each of the site areas, this would be coupled with doctrine to get the data out of the production db and then map it into some new models. Whilst this looked like a promising idea the fact was that the data wasn't in great condition and there would be a need to add lots more tables to hold the migrated local data. This started to look like we might be creating a monster rather than a solution. We eventually decided that what was needed was a migration 'layer' that the data would pass through so that it could be mapped into new database tables and models. The data could be modified, cleansed and keyed more efficiently as part of the move, this was easily the most attractive option. With this as the starting point any PHP framework would have been on the table, so we reached for our framework of choice, Laravel. 

The new website is a Laravel project with all of the front-end handled by blade and Bootstrap 4 (without jQuery!). All forms on the site are now handled by Vue JS so that the process is smooth and reliable. The admin side of the site is entirely built in Vue JS with Laravel handling the api.

The challenges in migrating lots of data

There are lots of challenges in migrating 17 years worth of data into a fresh new application. Missing data because of a lack of foreign key constraints was a big issue, we had to weed out lots of 'orphan' data. Then there's the issue of a lack of consistent date stamps or formats across records, this all needed to be addressed before we could think about the structure of the new database. There's also the issue of importing data that is still changing in production, so new inconsistent data is being created every day.

Our approach was to run the entire database through a 'cleansing' process before converting it into its new format. In a nutshell this process involved creating a fresh new MySQL database for the new Laravel application then having the old tables sat in an 'import' database available on another db connection. The old data would first be cleansed by a series of SQL queries written through trial and error as the import process evolved, the import would hit an issue we'd then write the SQL to clean that issue, eventually you reach a level of consistency in the data. After this the process would loop over every existing piece of data, run it through a converter then feed the converter output into the Laravel model to store in the new application.

This worked very well, whilst it was slow to run the entire import (9 hours) it was providing a reliable state for the new models. Dates were corrected, relationships stored correctly, slugs were generated and similar concepts/types were combined together. The end result was data from 17 years of chopping and changing tables converted into nicely structured, consistent utf-8 tables. This change alone has opened up massive opportunities for leveraging the site's data.

We ran into a few memory issues with these conversions, turns out that looping over 2.5 million records and building Eloquent models for each uses a lot of memory! The solution to most of this is just chunking the data, PHP CLI seems to be pretty efficient at this, the 9 hour import process happily ran inside a VM on a MacBook Pro. We did hit a weird issue when trying to speed up the import using bulk inserts, about 20k records in to a 2.5m record database the VM started consuming large amounts of disk space, as if a huge log file was being written very quickly, about 100M every few seconds. We couldn't get to the bottom of this issue, interestingly it seemed to start happening in PHP 7.4 only.

The entire import process was built using TDD, we mocked the old data structures and built the entire process up before importing any real data, this worked really well and for most of the tables there weren't any issues when switching between mocked and real data.

 

The new website

With the cleansed data in place the interesting part of the project could begin...

The new site design relies heavily on Laravel's blade templating system, using components, stacks etc. It was a pleasure to build the website, it has some nice features including a listing finder service that unifies all of the site's listings in a common format for searching. It also has a page data system that sits behind the controllers and allows a fluent interface to build items that should appear on the page, it also allows fine-grained control of the ads that apppear on any page.

The biggest change for the people running the website has been the new submissions system. Every piece of data that's submitted to All In London is now given a ticket number and held in a queue until it's processed by a member of the team. It can handle lots of different types of submission, user reviews, events, listing updates etc. Team members can reject submissions with a reason, they can also approve data in different ways. Once a data submission is accepted the system will create new records and bind the submission to it, any future submissions for this record will also be logged against it giving the listing a full history of additions and changes. The team have said that this has taken hours off the weekly admin.

A quick note about the admin panel: around 70% of the development of the new site after the data was imported was building the admin panel. The entire panel was built using TDD, a simple process when using Laravel as an api backend to a Vue JS front end. The admin panel is the core of All In London, this had to be 100% right before we created a single page of the website.

What about all the existing urls indexed in Google?

This was the major dilemma with this project from day one. Around 90% of the old site's content was served via .php files, so it would have urls like allinlondon.co.uk/directory/1111/2222.php. Lots of these pages had very good rankings in Google and other search engines so any risk of disrupting the seo had to mitigated.

The decision was made to re-direct these urls to new dynamic routed versions such as allinlondon.co.uk/directory/bakeries/a-bakery. This would provide a much better user experience in that the url would have meaning when read out and also you could navigate through the directory structures just by changing parts of the url.

We built a test suite that tested all of the existing url types and asserted that they were correctly redirected to the new versions. This involved looking at every part of the old site to ensure that all urls were covered. The challenges here came from the fact that the areas of the site that were dynamic were using urls like /whats-on.php?type=exhibition and whats-on.php?region=123, so we had lots of different potential urls that would need to be effectively remapped. All of the redirects returned 301s to ensure that the various search bots updated correctly. 

 

Impact of the migration on SEO and Page Speed

The site went live early in June 2020, it's now July 2020 so we've had a few weeks of monitoring the changes to see what the impacts have been. 

This migration has been a tectonic shift for All In London as it's never had such a change in it's 17 year history, the architecture has remained broadly the same in all that time. With this in mind the number 1 priority was not to disturb the existing traffic patterns coming from search engines. If you take a look at the Google analytics chart below you'll see the point of migration.

Chart showing migration day impact

As you can see after the launch day we had an initial spike, this looks like an anomally and unfortunately not an instant increase in page reads! The main point in this chart is that the traffic remained consistent, the redirects clearly worked and people were still getting the the pages they clicked in search results.

Looking at Google webmaster tools and various log files we can see that the only new 404s being generated were those we expected, some of the pages on the old site were considered low quality by the owners and they opted to drop them from the new site. We picked up on a few 500 errors but were able to find the cause and fix them very quickly, it affected only around 200 pages from 500k+. 

So overall we were very happy with the results of the migration, over 500 unit tests gave us a fair bit of confidence but there's always that risk you missed something, although with good monitoring and logging you can quickly spot issues.

Another change after the migration is the effect on Google's page speed analysis. We spent some time on optimising this as Google uses the page speed analysis results from lighthouse as a ranking marker, especially on mobile. We stripped lots of code out of the initial build, this included removing jQuery, building only the parts of Bootstrap 4 that we're actually using and replacing the Bootstrap jQuery with a pure JS replacement. We also optimised the page load as much as possible for mobile, this mainly involved getting as much code out of the initial execution as possible, the page needs to start painting as quickly as possible and be fully interactive in a relatively small window to be considered acceptable on mobile. If you look at the chart below you'll see that the impact of the changes has started to show:

Page speed analysis changesWhat this shows is a flip of over 100,000 tested urls on mobile flipping from 'poor' (red) to 'needing improvement' (orange). Whilst 'needing improvement' doesn't sound great it's actually close to impossible to reach 'good' when you have ads running on the page. It's relatively easy to hit 100 across the board if your website is simple, as soon as you start adding in features and functionality that require javascript staying out of the 'poor' performance is the really the best you can do. We optimised page speed to the point where if we disable ad serving on the website we get mobile results that are close to 100 (good), as soon as ad serving is enabled these drop to 50+. This shows that there's really very little else we can do, the poor performance is actually caused by Google's own javascript.

 

Do you have a website like ALl In London that you need to migrate?

In conclusion we're very happy with the results of this site build and migration. It's still relatively early days but the stats we're seeing daily show that it's looking promising . The owners of All In London have some big plans for moving the site in new directions, we can't talk about those :), but without this migration to Laravel none of it would have been possible.

This project was completed in 5 months, on time and to budget. If you're looking at migrating a large site like All In London to a new PHP platform get in touch as we've probably solved most of the problems you're facing.


Article Category Tags

PHP Web Design Web Development SEO Business