Where we came from
So, how did it all begin? We kicked off the Glodo party in 2019, as a spin-off from our sister company Glo Networks, and were fortunate enough to be able to borrow the experience of their team to have an awesome team lineup from the get-go. We knew we wanted a modern platform to build Glodo up on and we had identified a strong contender.
The traditional way to deploy Odoo (what is Odoo?) is as a service/daemon on a physical or virtual machine. Our goal was to use spare capacity on our sister company's existing on-premise Hyper-V cluster.
The problem here is that we didn't have access to rapidly spin up VMs. We were looking at hours rather than minutes for a new deployment. For demo environments, testing, etc. this is a bit of a bottleneck.
We were already using Docker for development and felt that it could be our ticket to a more efficient workflow.
Version 1: Docker Swarm
After a lot of research and testing, our initial infrastructure was entirely based on Docker Swarm. We deployed a small cluster of Ubuntu 18.04 VMs, managed the underlying OS and deployment of Swarm through Ansible, and deployed instances of Odoo manually through swarm compatible docker compose files.
Swarm was incredibly easy to deploy, worked well for small clusters, and allowed us to get on with the business of developing and deploying Odoo in the early days of our company.
We made minor tweaks over time, such as moving PostgreSQL out of the Swarm as we grew.
After the Honeymoon
By this time we were now operating 2 Docker Swarms (swarms? swarm?): ours and an on-premises deployment for a customer.
It was around this time it became apparent that we were hitting issues in Swarm. Most were minor irritations with workarounds (i.e. we needed something like CronJobs in-cluster), but 2 were really giving us issues:
Extra containers running that shouldn't be.
Swarm managers failing to communicate.
Normally we'd have dedicated more engineering time to find solutions and submit back patches. Unfortunately at this time Docker Inc.'s future was concerning, Swarm had not received updates for months.
By Feb 2020, Mirantis announced they were purchasing Docker Enterprise and would continue maintaining Swarm. However by this time all the uncertainty had left us actively looking at replacing Swarm.
Version 2: Kubernetes
Wait. Do we have Big Cloud Company Problems?
As we were building out Version 1 of our infrastructure Kubernetes had come up. We dismissed it because we weren't a Big Cloud Company, and we didn't have Big Cloud Company problems.
In retrospect I feel that's a bad take on Kubernetes.
It's certainly a very complex beast.
It's certainly not a solution you should just reach for without really thinking about it.
The learning curve can be steep; especially if you are doing it on your own infrastructure.
Despite that, after several months of testing (as a background task) we did opt for Kubernetes for several reasons:
Kubernetes has been eating mindshare in this space.
Hashicorp Nomad was considered, but that's a very different beast. We felt like we would need to add so much stuff on top to get what we needed that we would be effectively building a discount Kubernetes.
Apache Mesos. Again different, but more importantly seemed to be in active decline. We didn't want to be creating Version 3 in a few months time.
It allowed us to keep relatively close to our existing workflow, whilst allowing us to automate more.
By switching to Kubernetes we can move to any managed Kubernetes offering down the road to minimal changes to our manifests.
We get a unified workflow where the majority of our staff do not need to know about the underlying VMs, worry about capacity, etc.
The pragmatic approach
Realistically, we're still a small company. We don't host a vast number of Odoo installations yet. However, we do plan to. We take this seriously.
To achieve the right balance of engineering effort and benefit we ended up at a middle ground with Kubernetes by taking the bits we need now and ignoring the bits we don't.
Our setup currently looks like this:
NFS for shared storage across nodes.
Teleport for remote access (to both the nodes and Kubernetes).
FluxCD2 (gitops toolkit) to deploy core infrastructure deployments (we use metallb for load balancing, Traefik 2 for Ingress, cert-manager for SSL, external-dns to manage DNS entries, nfs-client-provisioner, and kube-prometheus-stack for monitoring).
SOPs for secret management (integrated with FluxCD2).
GitHub Actions to deploy projects (currently - eventually we're planning on replacing this with FluxCD2).
PostgreSQL still runs out of cluster (w/ WAL-G for continuous backups to off-site storage).
Duplicity for periodic project level off-site backups (per project) - we opted to not use something like Kasten or Velero so that the data within those backups are not tied to Kubernetes.
VM level backups are taken by our sister company Glo Networks.
Is this perfect? No. There's certainly holes in it and things we'd like to change. But is it what we need right now? Yes.
Is it rock solid? Absolutely. We've had zero problems relating to Kubernetes since moving to it (we have had issues with Traefik, but that's a separate story).
On-premises vs Cloud
One great benefit of switching to Kubernetes is that we can now easily move our cluster to any managed Kubernetes offering. Our manifests and deployments should just work with minimal tweaks. In the event we grow beyond our sister company's capacity we can just move.
What would Version 3 look like?
Honestly, our hope is that there is not a Version 3 for quite sometime.
Don't get us wrong we're still making minor changes. Lets say we're on something like Version 2.3 right now. However, the core underlying infrastructure is still the same.
Eventually I expect we'll either outgrow k3s, or k3s will grow as we grow. Metallb we're actively considering replacing with kube-vip. But these are all "minor" in the fact that we're still just deploying Kubernetes.
Now that the Kubernetes Honeymoon is over, would you use it again?
I don't see why not. Infact we now operate 2 almost identical clusters: one for ourselves and another for a larger on-premises customer.
Let us know your thoughts and experiences!