Focal Fossa Infrastructure Upgrade Event Timeline and Summary

Last updated: May 21st 2020

Introduction

As some of you probably noticed we had some issues on the platform following the Focal Fossa upgrade on monday. Here we describe the bugs that were identified and how we resolved them as well as outline any lingering impact these might have had as well as how we will proceed in the near future as well as specific advisory for Runcloud users (please see below).

Timeline

After the upgrade which completed monday evening everything looked as if it was running normally and we happily called it a day. The following morning however things took a turn for the worse as we started getting a steady stream of reports. They roughly fell into 3 categories:

  1. Users were unable to SSH into their server or at least experienced severe delays while connecting
  2. Runcloud users saw that Runcloud reported that there was no connection to their server
  3. Newly provisioned servers were slow to come up and Certbot would fail

Later we hit another error condition where on one of our hosts servers could not be restarted and triggered a weird SQL error response.

These issues were ultimately identified and resolved as the following:

1. SSH Connection issue: Tied to a Load Average virtualization component in LXD (our hypervisor)

This was the worst issue to debug and was the core cause of the up to 3 system reboots our users experienced in the 36 hour period following the upgrade. Due to the creeping nature of the issue, which manifested itself in that the more busy a system was the quicker it surfaced but took at least 2-3 hours to really kick in, we at first erroneously thought that the fix we had implemented - which consisted of disabling a kernel module - had worked. It turns out that it was in fact just the reboot itself which we did which unstuck the systems temporarily and the issue resurfaced for all customers after a while.

This led us to believe that our second fix also worked, when in fact it didn't.

It wasn't untill we discovered the real cause of the problem on Wednesday morning that we finally got this resolved. We chose to wait a full 24 hours before sending out this bulletin in order to make absolutely sure that the issue was now fixed.

Impact:

The impact of the fix are neglible but as we had to disable the load average virtualization component then if you look at load average for your server now, you will see load average as it is for the entire host system and not just for your VPS server. For regular Webdock customers this is purely cosmetic and does not matter as when we check for high load we are not looking at this on a system level but rather through other mechanisms.

Where this has a minor impact is for 3rd party control panel users such as Runcloud who might now experience "high load" warnings periodically as Runcloud is seeing the load average for the entire host system and not just your VPS - this may lead you to think there is anomalous load on your server when there is in fact nothing anomalous going on.

Beyond this, then our disablement of our ZFS I/O subsystem optimizations and the aforementioned kernel module during our intial face of resolution is nothing which is of a critical nature or impacts your VPS in any way other than the overall I/O system performance being "as it was" before the Focal Fossa upgrade. We would like to re-enable these "quality-of-life" enhancements again sometime in the future, but this will happen sometime down the line when we need to do a system upgrade anyway, where we will then bundle the re-enablement of these components. We do not want to incur anymore system reboots or downtime for our customers at this point

Ongoing resolution:

The load averaging virtualization is a really nice thing to have so we are working directly with the LXD team at Canonical to resolve the underlying issue. Once it has been resolved we will do thorough testing and due dilligence before this component is re-enabled on our production systems.

2. Runcloud Control Panel Connection Issue: Tied to incorrect Kernel patch detection in Go. System kernel limit increase fixes the issue.

This issue was found with the kind assistance of the Runcloud support team and a couple of our patient customers. The issue was tracked to an issue with Go which is the programming language powering the Runcloud service which runs on your server and communicates with their control panel.

The Runcloud service was unable to start up properly as some Kernel detection in Go was erroneously failing on our shiny new Focal Fossa kernel. This is really a Go issue and not something which indicates a problem with either Webdock or Runcloud. The solution was a workaround which involves upping a kernel limit on memlock in your VPS which enables Go to run through a workaround routine which it thinks is needed - as it thinks our Kernel is insecure (it is not)

Impact:

This only impacts Runcloud users and if you see that Runcloud cannot communicate with your Webdock VPS server, please restart the server in the Webdock control panel. It should now come up and Runcloud can communicate with it again.

A way to check if the fix is active for you is to SSH in to your server and execute "ulimit -l" it should print a number of 16384 or larger. If it prints something like 64 then the fix is not active for you and you need to reboot your server for it to take effect. As always if problems continue, please be in touch with us.

Ongoing resolution:

None, this issue has been completely resolved and at most requires a reboot of your Webdock server to take effect.

3. Newly Provisioned servers were slow to come up

This issue was very quickly resolved and was a simple matter of cloud-init behaving differently on Focal Fossa and causing network issues in new servers. Any existing servers were not affected.

Ongoing resolution:

None, this issue has been completely resolved

4. SQL Error on restart of servers

We identified this issue with the help of the LXD team at Canonical as an undocumented kernel parameter that needed to be modified. This only impacted a single host of ours as the kernel parameter was already correctly set on all other hosts. The fix was in the end quick and without any further impact.

Ongoing resolution:

None, this issue has been completely resolved

Summary for Runcloud users

Runcloud users were especially hit by the upgrade as they suffered from two distinct error conditions which prevented communication and access to their Webdock server. For this we apologize. Furthermore Runcloud users are pretty much the only users who have any lingering impact from all of this, which consists of:

1. You will see higher than normal load on your server as reported by Runcloud untill we re-enable load average virtualization

As mentioned above this is nothing to worry about and we always monitor your server independently here at Webdock and will be sure to warn you if true anomalous load occurs. The fix for this will be forthcoming as we find a fix for this load average reporting with the LXD team at Canonical (see details above)

2. Your server may be unable to communicate with Runcloud: To fix this, reboot your server

A reboot is required as otherwise the kernel parameter limit update we did does not take effect. Only do this if you see that Runcloud cannot communicate with your server. As mentioned above: 

A way to check if the fix is active for you is to SSH in to your server and execute "ulimit -l" it should print a number of 16384 or larger. If it prints something like 64 then the fix is not active for you and you need to reboot your server for it to take effect. As always if problems continue, please be in touch with us.

Thank you all for your patience

We would like to thank all our awesome customers for their patience while we worked through these issues and especially the ones who kept us in the loop on what they found out and helped us test out various things: You are all very awesome.

We now look forward to a period of consolidation where we focus on eliminating any and all wrinkles on the platform and ensuring rock-solid stability for a good long while before we introduce the many new enhancements and upgrades we have planned for you all.

We look forward to providing you with many more years of awesome hosting here at Webdock :)