Pitfalls of the drupal maintenance page

Working with Drupal at scale can be a tricky beast, managing state, database load and complex caches all have their challenges. One challenge that completely blind-sided the team I was working on, was the Drupal maintenance page.

The Drupal maintenance page isn't exactly, that

I ran a deployment for a large site that has very high levels of traffic. The flow was to trigger the maintenance page for all requests using drush, run the deployment (config import etc...) and then remove our maintenance page when all the work was completed. Only problem was, it didn't quite go like that this time.

The deployment script was hanging for a little longer than usual, which is odd but nothing to be alarmed by. Until we got this not-so-friendly warning pop up in the console;

SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction

The site was broken, css and js were not compiled (we were running the aggregation module), so we attempted another cache rebuild this time manually from the commandline of our containers. Again we got the error, nothing was looking better - we were in big trouble. We decided to run a drush cr one last time with the database lock occurring again. Fortunately this time everything looked to be working, as far as we could tell, so we began urgently figuring out what when wrong.

The maintenance page lies to you

If you're running a site with enough load, the maintenance page isn't a guarantee that things are going to work correctly. The idea is that the maintenance page will isolate the database while the dance of various drush commands happens in the background. This is to ensure the database state isn't crazy or unexpected during an upgrade, which makes perfect sense. Drupal is quite sensitive to bad state during this period.

The root of the problem? Despite being the recommended way to run upgrades on a drupal site, drupal (on every request) still needs to bootstrap and figure out its state. This means the maintenance page still opens a database connection and in our configuration it was returning a 500 http status along as well.

This caused a two-fold problem. Our CDNs were getting 500 responses, meaning it busted the CDN cache. This had a knock on effect of forwarding much more traffic than normal to our origin servers. These requests to the maintenance page ended up overwhelming the database. The database was under heavy load from the upgrade and the number of open connections to the database cause a deadlock on a row, causing drush to abandon its work.

Solutions to the problem during the outage

Hope and pray that your database recovers and redis caches enough to attempt again. Honestly, we just got lucky that our database recovered quickly enough to run the drush commands again and that they worked well enough to have a functional site.

The outage lasted about 40 minutes, so we floated the idea of sinking traffic at the CDN level, but fortunately we didn't have to do this. The solutions are very similar to being under a DDoS attack, and it is going to be hard to mitigate.

Solutions to the problem after the fact

There are realistically five layers we need to think about to solve this problem. Our;

CDN
Redis implementation
Database (usually mysql in drupal land)
Drupal itself (code base)
nginx/apache/whatever http reverse proxy you're running

The first problem is that (in our configuration) the maintenance page was issuing 500s back to the CDNs, as soon as this starts to happen the CDNs the (rightfully) stop caching the given url and forward traffic to origin. There are a few solutions to mitigate this problem:

Cache the 500s on a very short lifetime (maybe 5-10 minutes)
- We decided to not do this, as it would prolong outages and cache headers are forwarded to clients as standard content, which is also sub-optimal. But if your content is slow moving enough and your tolerance for downtime was a bit higher than ours, it is a legitimate solution.
Return a 200 from the maintenance page at the drupal level.
- This has a massive problem associated with it. if you're running a high cache lifetime (let's say a day) the maintenance page will be cached as a legitimate result in your CDN and presented to users as such. So we were 100% unable to do this. If you have a very short cache lifetime like 10 minutes, again this is a possible solution. Since our cache lifetimes were about a day, we couldn't use this solution.
Run a larger, more powerful database
- We were already running a very large mysql instance in a master/read slave configuration. We did actually look at this as the first port of call. But we didn't want to just do this and risk another deadlock in production. It could have been a deadlock for not-load-related reasons as well.

We finally settled on a fairly custom solution using nginx.

The basic idea is that if we were able to short circuit the traffic at the nginx level, we would be able to isolate the drupal instance during an upgrade so no traffic was going to it. A kill switch of sorts. We coded up a maintenance page in pure html that was built into our nginx containers that would be served up when another file was available on disk. So part of our drupal deployment process would create a specific file in the shared EBS volume (where the public files were stored) and nginx would check for this file on request. If the file was available, would return the maintenance page with a 500 http result (using a nginx lua script). This way, we;

Didn't touch the drupal servers during an upgrade
Didn't pollute the CDN with drupal maintenance pages
Didn't require custom code to be bolted onto the drupal maintenance page (which we figured could regress easily and is pretty hard to test properly)

Final thoughts

It is a massive caveat of the maintenance page that it still initiates a connection to the database when it is active. Given the Drupal docs say this is a hard requirement of an upgrade to a site with traffic it implies the maintenance page will short circuit all traffic to the database. This isn't true.

Our solution was pretty crazy, and it's a bit crazy there isn't a more isolated mechanism for the drupal maintenance using a redis cache or something similar. If you do find yourself in this scenario your solutions are limited (generally) to isolating traffic to drupal and the database during an upgrade so the maintenance page does not bring down your site while the database is under heavy load.

Tuesday, 30 November 2021

Plugin architecture

The Swiss army knife of architectures

medium software-architecture python