Kubernetes has dramatically shifted the trade-offs of on-prem versus SaaS deployments. Thanks to the rich abstractions Kubernetes provides, deploying software on-premises can be significantly easier than it used to be. Because Kubernetes has achieved such high market penetration (and still growing), it is now a viable target environment for many software products. Nevertheless, Kubernetes requires external tools to be production ready, especially on an on-prem deployment.
The purpose of this article is to list tools that everyone should be aware of when it’s time to move an on-prem Kubernetes cluster to production and by on-prem we mean not in a cloud environment. In the cloud, it is obviously better to rely on cloud services offered by the provider.
Use the right container engine
First, forget about Docker Engine, it’s overkill for what Kubernetes needs. Today, there are multiple tools that better manage the container network interface (CNI) and the container storage interface (CSI). Focus on simpler container runtime interface (CRI) like Containerd . It will be probably the new standard as it has already proven its efficiency and maturity.
Distribute your data
Storage is probably one of the most critical parts. Without storage, the purpose of a container is limited to serverless or cron jobs. In the cloud, the best option is usually to use the default storage engine proposed by the cloud provider. On prem, a distributed storage engine is required to dynamically create volumes based on local disks.
Many applications exist today to easily manage volumes on Kubernetes. Rook is probably the most used to deploy a distributed Ceph cluster. Ceph has many advantages as it is able to manage blocks, files and object storage. The installation of Ceph can be complicated to understand but, Rook makes it easy to do on a Kubernetes cluster. This is definitely a solution to consider while evaluating how to distribute storage on a Kubernetes cluster.
Externalize sensitive data
Everyone knows that the internal management of secrets on Kubernetes is not the most secure way to manage sensitive data. Data is encoded in base64, not encrypted, meaning everyone that has access to the resource can decode the information.
Two approaches can be used in production today, centralize the sensitive data in a Vault cluster and inject the info within the container thanks to the Vault agent or encrypt the data in Git thanks to Mozilla SOPS and decrypt the data during the deployment.
The two approaches are perfect especially when an infrastructure is based on the GitOps methodology . SOPS has the advantage to not require the management of a cluster as the code is encrypted and pushed to a Git repository.
Monitor the workloads
Monitoring is the key to the success of a Kubernetes deployment. Being able to observe what happens to every object deployed in the cluster is a top priority.
Prometheus is obviously the tool recommended as it has a deep integration with Kubernetes to extend the liveness and readiness probe. It is highly recommended to deploy the Prometheus stack with the Helm chart to quickly get metrics, alerts and graphs of each piece of the cluster.
Prometheus is great but when it is time to talk about long term storage, archiving and multi cluster, it can quickly be limited. Enter Thanos , a highly available Prometheus setup with long term storage capabilities. Thanos is perfect to centralize the metrics of multiple clusters and distribute the queries to get a quick observability of the resources.
Centralize the logs
Observability is not only monitoring; logging is another important part. The log extraction of each container is critical to ensure a good debugging after any issues.
FluentD is probably the most used agent. It is also used by companies specialized in log management like Sumo Logic to aggregate and centralize the logs on their remote platform.
Vector is a newer project that might be interesting to use to enhance the log management. Vector, compared to FluentD, is not just an aggregator. It is designed to be a real ETL tool making him a real good candidate to improve the log management pipelines.
Observe security events
Another piece of the observability of a Kubernetes cluster is the management of the events. These messages are really important to troubleshoot issues. They can be of different kinds: Kubernetes events, container events, application events and security events.
Falco is a nice tool to have on a Kubernetes cluster to manage the security events. It is a threat detection engine that can detect unexpected application behavior and alerts on threats at runtime. It can be combined with the log management pipeline to enhance observability on the threats and actions required.
Another piece of security that every Kubernetes cluster should have is a policy agent to control resource management on Kubernetes.
Kyverno is a policy engine designed for Kubernetes. Based on the Open Policy Agent, it allows using familiar tools such as Kubectl or GitOps to manage policies as Kubernetes resources without the requirement of learning a new language (Rego) to write policies.
Certificate management is critical in production, especially when communications have to be encrypted end-to-end. The renewal of certificates is also a critical task to ensure the continuity of the service. Having to manage your own certificate authority can be tricky in certain situations, even more in a dynamic environment like an orchestration platform.
Cert-manager makes it easy to automate certificate management. It makes it possible to provide ‘certificates as a service’ to developers working within a Kubernetes cluster.
Load balance the traffic
As everyone knows, running a Kubernetes on-prem does not offer an implementation of load-balancers. Bare metal cluster operators are left with two lesser objects to bring user traffic into their clusters, “NodePort” and “externalIPs” services. For various reasons, it is recommended to use “LoadBalancer” resources in a production cluster.
MetalLB is a load-balancer implementation for bare metal Kubernetes clusters, using standard routing protocols. Based on an IP range resource definition, it can automatically assign an IP to a local LoadBalancer resource and thus follow the best practices of Kubernetes.
Expose and access your applications
Service mesh and Ingress controller are obviously different. A service mesh is not required on a bare metal cluster, but it can be nice as it comes with a lot of features and a potential integration with external components. At least, an ingress controller is needed to manage the access to the micro services.
Istio is probably the most known service mesh application on Kubernetes for all its features (ingress controller, tracing, observability, mTLS, etc.) but also for the complexity of his management. Istio is definitely a tool that requires a proof of concept to identify the ratio between the features really required by the workload on the Kubernetes cluster and the complexity of the implementation.
Speaking about external integration, Istio integrated with Flagger is a powerful combination to manage the lifecycle of every application on a Kubernetes cluster.
Automate the deployment of containers
GitOps is a great methodology to automate the deployment of applications on a Kubernetes cluster based on a single source of truth: Git.
They all have the same goal: ensure that the desired state is currently applied to the Kubernetes cluster each time an update is made to the remote Git repository.
The difference is in the implementation of the checks. On the first hand, JenkinsX and ArgoCD are adopting a push logic, which means that they require a trigger to apply the changes. On the other hand, Flux is adopting a pull behavior as it has an agent continuously checking that the state is as expected.
Backup your data
On premises, a disaster recovery (DR) plan is even more important than in the cloud. Nobody will create a magic automated replication of a bucket to another region for you. Thus, backing up Kubernetes resources is primordial to ensure a quick recovery in case of a failure.
Kubernetes resources are not only YAML files definition but also the data stored on every volume created by the container storage engine (like Ceph).
Velero is an open source tool to safely backup/restore, perform disaster recovery, and migrate Kubernetes cluster resources and persistent volumes. It’s definitely a tool that should be part of the disaster recovery plan of a Kubernetes infrastructure.
Test your cluster
Managing an on-prem Kubernetes cluster usually means that the operation team owns the deployment of the resources needed to set up the cluster. As Kubernetes has multiple moving parts, it baffles most system administrators, and most of them wonder whether their configuration is correct and configured the way it should be.
Sonobuoy is a diagnostic tool that makes it easier to understand the state of a Kubernetes cluster by running a choice of configuration tests in an accessible and non-destructive manner. Basically, Sonobuoy can run CNCF conformance tests, performance tests and some security tests to ensure that the current configuration of the Kubernetes cluster is compliant with standards of production. The tests (that can take more than an hour to be performed) cover a lot of different areas to highlight misconfigurations and performance improvements.
Troubleshoot in live
Troubleshooting is probably what everyone does everyday, so it’s important to use the right tool to be efficient during debug. The way to do it depends on multiple factors, but an easy way to do it is to use a local IDE or a remote central graphical interface.
Lens is a simple IDE that can be installed on your laptop to give you a quick overview of each resource deployed on Kubernetes. Think about it as an interface for Kubectl. It saves a lot of time during the debugging of an application as it gives easy access to the metrics, the logs, the config and a Shell access to the container itself to run commands. For the command line lovers, K9S exists.
Kubevious on the opposite can be deployed inside Kubernetes to share a central graphical interface with almost the same features as Lens.
One of these tools is definitely recommended to manage every Kubernetes cluster on-prem or cloud based.
Kubernetes gives an extended life to on-prem architecture by implementing some cloud native aspects to static resources. Obviously, the deployment of a cluster ready for production requires more work compared to the equivalent on a cloud provider. Automation, security, access, tests are different aspects that must be validated internally before opening the cluster to real usage.
Doing this job means applying cloud native concepts to your local environment with all the advantages, observability, flexibility, high availability, automation, etc. In some cases, it could also be the first step to a cloud migration!
About the authors
Hicham Bouissoumer - Site Reliability Engineer (SRE) - DevOps
Nicolas Giron - Site Reliability Engineer (SRE) - DevOps