How we optimized provision time from 4 minutes to 25 seconds

Last updated: November 28th 2024

Introduction

Webdock has always had fast provisioning times, but with the rollout of our new infrastructure in Denmark and after all the changes we had done, we were not happy with provisioning taking up to - in some cases - a full 4-5 minutes. Here we detail how we managed to get provisioning time down to our best-case time of 25 seconds on an Epyc instance.

Analyzing the issue; finding the first minute

When we started working on this, the very first thing we noticed was that when we wrote network configuration we were firstly shutting down the instance, doing some stuff on the host, then starting it and writing network config and lastly doing a final reboot. Doing a shutdown and reboot of instances is an expensive operation, so the first thing we looked at was whether it was possible to apply networking after the very first boot of an instance and not have to restart it twice. With some magic on the host and with cloud-init, we achieved just that. 

Once we had optimized that part of the workflow, we shaved off a full minute off provisioning time - now we were down to 2-3 minutes on average. Nice!

Using optimized images

After our launch of the Denmark DC - and rather recently in fact - we switched our Hypervisor from LXD to Incus. Incus is a truly open source (and much better maintained) fork of LXD. Despite this change, for compatibility reasons we were still using LXD images. After speaking with the guys that built our new hypervisor, Incus, we found out that the LXD images were extremely ill-optimized and included all sorts of junk that wasn't really required. This caused boot times to be slower by at least 10-15 seconds per boot.

Switching to Incus native images, which are much better optimized, saved another 20-30 seconds, getting us down to 1.5-2.5 minutes.

Making sure images were cached

The next optimization we looked at was that we noticed that a lot of times images were being downloaded from our image remote and not properly cached on our hosts. An image download could take something like 30-40 seconds, so making sure images were cached on hosts reduced provisioning time even further - we were now down to 1-2 minute provisioning time.

But wait: Why were instances booting twice once spun up the first time?

The next thing we noticed was that whenever a new instance was spun up it was doing a restart after having completed almost the entire boot process, why? After conferring with the people who built our Hypervisor - Incus - we found out that this was due to some race conditions and cloud-init needing to initialize some stuff through templates passed through the hypervisor. We found that we could eliminate this step, ultimately getting us down to our current "record" time of about 25 seconds for a base Noble image on Epyc.

Your results may still vary

Although our current best provisioning time - a base Noble image on an Epyc instance - is about 25 seconds, your results may still vary a bit. Although images are now generally cached, sometimes our system needs to re-download an image, adding about 20-30 seconds. In addition, our LAMP/LEMP stacks are a bit slower to provision as we do a lot of additional configuration steps to set up users for ftp, mysql etc. - also, which profile and platform you provision on impacts provision time - obviously a pico4 profile on Xeon which only has 1 CPU thread to work with will provision slower than a beastly Epyc instance.

With all that said however, you should in real life - most of the time - see sub 1 minute provisioning and worst case 1-2 minute provisioning time.

We are pretty happy with that result :)

Arni Johannesson, CEO Webdock