Our Journey to Kubernetes

Nick is our lead systems architect who is leading the design and development of our cloud native infrastructure based on Kubernetes.

27 August 2018

Filed under Infrastructure and Cloud.

This is very much work-in-progress, we’re not there yet, so what better time to explore our current experience and write a story about it.

We grumbled back in 2016 that we wanted to write and deploy modern applications, and that our technology stack was starting to limit us. These grumbles eventually gained enough momentum for us to take this seriously, and our movement towards modernising our infrastructure began.

The “Why Kubernetes?” is pretty obvious. I won’t bore you with the details, but lots of us love containers, and we need good software to run our containers — Kubernetes is pretty much the industry standard and comes with all the functionality we require. That might not be true, but that’s what we thought when we started :)

October, 2016— Step 1: The Start.

At this point, we didn’t really know much about containers, let alone Kubernetes. Our initial struggles were more about balancing these new ideas with the reality that we had existing systems to maintain and focus on.

We certainly didn’t have the ability to fire up a new team to work on this or to re-allocate existing resource, at least in the short-term.

What we could do was find motivated individuals to research into the area, and as it turns out, this step was pretty easy.

For us at Mintel, we like to understand how things work. It’s software, it’s going to have bugs and flaws, it’s going to need upgrading and customising one way or another. I don’t feel comfortable maintaining software without knowing some level of detail on how various components work, and this is a view shared by many of my peers.

We went on a 2 day workshop to discover more about Kubernetes — I wouldn’t say this helped us learn Kubernetes overnight. Far from it. It did something great though, it inspired and motivated us, and it helped to validate our decision. You sit in a room with people telling you all the awesome software they are running, and you think to yourself, “I want what you have”.

In terms of getting hands-on with Kubernetes, we give thanks to kubernetes-the-hard-way by Kelsey Hightower.

This covers all the main concepts. It makes you configure Kubernetes from scratch and forces you to understand the processes that are running, as well as exposing you to the configuration options.

What it doesn’t do though, is tell you how you should run Kubernetes in your organisation. What underlying OS do I use? How do I deploy it? What networking options? What Kubernetes version? How do I secure this specific feature? RBAC what? Where do my private images come from? How do I monitor and alert? How do I keep sane with all the non-stop version updates from every single thing I deploy?

But ignore all that — you will find answers to those questions as you progress forward, from your own experiences, from your peers, from reading online blogs, from going to workshops. Oh, I also fully recommend going to meetups — you learn a lot, and often get free pizza :)

August, 2017 — Step 2: A trial deployment.

Yes, nearly a whole year has passed — that problem of finding time to focus on something new is very real. In fact, it’s our biggest issue I would say.

Do you hire contractors to come in and deliver a new tech stack? Do build up internal knowledge whilst juggling with existing workload? Do you devise a plan to re-allocate resources over time? Do you forget about it and hope the problems go away on their own?

Our decision was to build up knowledge over time, re-allocate resource from internal, and hire externally when the right opportunities present themselves.

Once our initial team (made up of 2 people) was established, we sat down and started to plan, prototype, and learn. This was a rather crazy period that lasted many months. Lots of new technologies, and a large chunk not even Kubernetes related.

We already manage a lot of VM’s on-premise (over a 1,000), so our immediate thoughts were to go about deploying an experimental Kubernetes cluster utilising existing hardware. It was the quickest and cheapest way to get started (or so we thought).

That’s what we did, we got some cheap Dell servers from eBay, put them in our racks, and ended up going with CoreOS. Why CoreOS? After running CentOS 6, anything is better. No, seriously, probably because it felt minimal, simple, and we’re running containers right? So why not an OS geared to that use-case.

How do we configure these things, it’s all container based! That’s where Matchbox from CoreOS also comes into play. Ok…lots more learning, and perhaps a little too much YAML and JSON for my liking. Get used to it, there’s plenty more.

We actually ended up rolling our own ignition files to deploy Kubernetes components. We tried real hard to follow best practise at this point, you know TLS everywhere, and separating out the etcd instances, compiling lists of the many guides online, starting to build good documentation, and joining various Kubernetes slack.io channels.

At this point, things were pretty cool — we actually had real immutable infrastructure. We could kill CoreOS instances, and have them come back from scratch and rejoin the cluster. Upgrading the underlying CoreOS instances was simple — bump the version, reboot the instance, wait for it to come up (auto-updating the OS felt scary as the docker version was kinda important to us if we were going to run Kubernetes…).

However, we still didn’t have anything useful was running on our Kubernetes cluster…just the core components (and pastebin).

We needed a way to deploy things. We chose Ansible for this. After plenty more YAML, we could configure Kubernetes components and deploy various “core” applications such as Prometheus for monitoring, a Docker registry backed byMinio, and Dex for authentication against our LDAP backend.

We spent a long time looking into other areas as well — Ceph is a good example. We already use Ceph for storage, but Ceph in containers is pretty cool and Kubernetes has some great support for this. Plus, the version of Ceph we could run using containers was bleeding edge!

March, 2018 — Step 3: Hey boss, can we start again, we don’t like on-premise?

That’s essentially what happened, and we think it’s still the right decision.

Maintaining infrastructure is hard, it’s expensive (at least for production-ready clusters), and it doesn’t really enable us to scale-out quickly — not unless you live next to a shop where you can buy disks and CPU! A key issue for us was that we wanted to isolate the cluster (at a hardware level) from anything else on-premise. This would have really increased the costs, especially when you want to provide fast persistent storage.

It was a process that really helped us learn an awful lot though — the very fact that our cheap servers from eBay would randomly die due to hardware faults was fantastic — that’s exactly what you want when trying to gain confidence in software that promises to be resilient.

So we did, we started again, this time we looked to a cloud provider (GCP in our case).

Step 4 April, 2018 — to the cloud!

A few things to note here — this really was throw-everything away and start again (in terms of code).

At the same time, we also transitioned from SVN to Git, and we also took on the policy of having a firm break between existing and new infrastructure. This meant no VPN between our on-premise sites, and our cloud-hosted instances (at least for now). We believe this would force us into doing things better.

Onwards...

Managed Solutions

We weren’t really convinced by them. GKE (Google Kubernetes Engine) and the others are great, but also opinionated. For example, GKE doesn’t let you run Alpha Clusters for more than 30 days. Whilst we’re not going to turn on every available alpha feature, there are some that really interest us (or we think are actually required). If you’re grown up about things and accept the risks, this is OK. You just need to know that it’s all subject to change, and maybe not all that reliable. The Kubernetes documentation is very transparent about this, plus being able to submit issues upstream is neat.

A lot of the managed solutions also force you to upgrade when it suits them, or push you towards certain solutions — maybe we want to upgrade earlier or later, maybe we want to try another fancy Ingress or Networking layer. I’m pretty sure the list goes on and on, but you get the idea.

Kubernetes in a one-liner

So we looked at tools like kops, kubespray and typhoon. Again, these are amazing tools, but also opinionated. Maybe we are greedy, but after seeing all the possibilities available, we like our freedom.

We did try them though, and in many cases, things just don’t work as expected — then you’re left with raising issues, diving into others peoples code, forking, etc. We really didn’t want to fork these things just to remove the things we didn’t require and then have to add in our own requirements…

Provisioning

We still went with CoreOS, but this time we chose Terraform to provision our instances, configure our networking, and a few other bits and bobs. Again a pretty easy decision given how widely adopted it is. Terraform alone though didn’t feel enough — we struggled with maintaining multiple versioned modules, so looked to Terragrunt .

I’m not saying it’s perfect (I’m sure Packer could help simplify things), but will feel like we can maintain our modules without any real issues. It’s also well supported — well, Google provides good hits at least!

We’ve grown to like Ansible (especially me, since I have a Python background), so this still plays a similar role for us as before — configuring additional Kubernetes applications onto a new cluster that we consider to be critical.

And the really cool thing? We actually use CI (Continuous Integration) to spin up entire clusters when we tag a new release of our modules. A tip, don’t use Google’s Preemptible instances for CI, they will die during the run, and that might annoy you ;)

Access Control

Oh shoot, everything’s public!

Behind the scenes we’ve adopted good security practises regarding Kubernetes — RBAC, Pod security policies, Network policies, TLS everywhere, no unwanted open ports etc.

But, we still needed to authenticate our public services. This is where tools likeCoreOS Dex or KeyCloak can really help. There was a missing component though — how do our users (devs / cluster-ops) authenticate against the cluster? We ended up writing a utility called dex-k8s-authenticator to support the generation of kubectl commands for users. Together, these set of tools enabled us to authenticate via OAuth2 using our existing SSO backend.

This meant all our users (or roles) stayed in our existing IDP, and anyone with a valid set of credentials could generate a kubeconfig to work with 1 or more of our clusters.

Storage

A revelation! Buckets are easy to come by, especially when provisioned by Terraform. Persistent volumes, dynamic provisioning — all of this is super easy. Kinda…it takes time to write the module, but the concepts are fairly strong and it’s not magic.

We briefly looked into the Kubernetes Service Catalog . I’m keeping a close eye on this as it’s really cool (buckets, databases, custom-apps, all on demand for your devs!).

GitOps

From the start, we wanted to embrace automation. That’s why we use tools like Terraform and Ansible right?

GitOp’s builds on similar principles — your code is the single source of truth. What this really means for us is, tear down your applications (or entire infrastructure), and ensure it can come back without requiring manual intervention (or at least minimal).

There’s often subtle things you can do here to make life easier. One example might be storing Grafana dashboards in Git, and loading them from a repo on-startup.

Another is using tools or writing code to watch for modifications on Kubernetes resources, and then take some desired action. For example, we dynamically modify the Dex configuration using dex-k8s-ingress-watcher (via gRPC) when some of our public facing sites are deployed — automated authentication for free, centrally managed.

Resource Quotas

Hey, developers! Resource is not free!

We have quota limits, we actually do this per namespace, and we copied concepts from various cloud-providers by allocating each namespace a number of “Gold Coins”. Essentially this enables us to specify how much resource each namespace can use. All of this is configured by an Ansible role. If a namespace (think team) need more quota, we just update and re-apply the Ansible role (version controlled of course).

Cost Control

Cloud is expensive too, and costs can rocket super quick. We’ll need better monitoring in this area, but there’s some nice Grafana visualisations that work against our exported billing information from our Google Account, so that’s good enough for now (maybe…). We also tried Google’s Data Studio , but I’m not a big fan (we love the freedom that dumping data into Elasticsearch provides).

Another thing we’re doing right now is to save costs by shutting down instances overnight. Easy enough to implement using some scripts calling gcloud commands, and it’s nice to have our visualisations confirm that we are actually saving money by doing this.

August , 2018— Back to now

We’re not in production, but we’re comfortable deploying our cluster(s). We have a lot of the core components in-use that we think are required to run a cluster. We’ve gone from Kubernetes version 1.5 when we started back in March 2017, to 1.9 (1.11 is now in-progress).

Meanwhile, we’re starting developer training sessions, we’re focusing on CI/CD, and we’re trying to figure out which deployment tool(s) we want to use for our applications —Helm, ksonnet, kustomize to name a few.

We need a better way to work with secrets, Vault is perhaps the industry leader here, but we believe it requires additional tooling to manage polices.

Another option might be sealed-secrets by Bitnami-labs — a super simple and Kubernetes native approach.

I didn’t cover it in the above steps, but as a background task, lots of investigation into cloud-native solutions has been going on (clustered databases, external monitoring). Understanding this is really important; are they really cloud-native? Can you read/write to any node? What happens when they failover? How do backups/restores work? So many more questions, and each technology often different to the next — hence building a team with Kubernetes as the foundation (essentially a platform) is really important. That’s something we’re also doing (but hiring is a whole other subject).

Finally, we figured out that Kubernetes is great for Bitcoin mining! Not really boss! :)