Businesses of all sizes are shifting some or all of their data and applications to cloud computing environments to take advantage of all the benefits available to them: availability, flexibility, scalability, accessibility, etc… The migration and the tools used to do it have to be assessed correctly to be successful.
The purpose of this article is to list tools that everyone should be aware of when it is time to manage a cloud infrastructure. In the cloud, it is obviously better to rely on cloud services offered by the provider but sometimes it is better to use open-source projects to facilitate the management.
This article is not dedicated to only one cloud provider, the idea is to list tools that can be used on one or multiple clouds to help in the management of the resources and sometimes, optimize the infrastructure.
Let's start with the command line
The command line is probably something everyone used once in their life, and sometimes it can be hard to understand how to use it. Most beginners start with AWS Console, the default GUI, while operation guys generally prefer a command-line interface. Problem is, the AWS CLI, for example, is not user friendly. Because it integrates the entire AWS API, it exposes an enormous number of commands, flags, and options.
An open-source project named Awless is born based on the need for a fast, powerful, and easy-to-use CLI to manage AWS. Awless is designed to manage an AWS infrastructure, from scratch, always get readable outputs (for both humans and programs), explore and query all cloud resources (even offline), connect to instances, and create, update, and delete cloud resources.
Automation is the key
The command line loses more and more sense today with the emergence of the DevOps methodology where automation is a key component. Infrastructure as Code (IaC) is the process of writing code that will take on the task of creating and maintaining a cloud–based infrastructure.
Unfortunately, the cloud service is not necessarily the best one to use. Compared to open-source tools, the code can be more complicated to maintain so that's why multiple tools exist today to easily manage an existing or completely new architecture. Terraform is probably one of the best tools to manage cloud architecture. Terraform is an open-source infrastructure as code tool that provides a consistent CLI workflow to manage hundreds of cloud services. It codifies cloud APIs into declarative configuration files.
Terraform deserves probably an entire article to explain all the possibilities that it offers, but if you are interested in this project, take some time to look at these projects too: Terragrunt, Terraspace, Terratest, Atlantis.
Another famous automation tool that can be used to manage a cloud service is Ansible . It is an agentless tool that provides simple but powerful automation. Ansible is used for different reasons, application deployment, updates on workstations and servers, cloud provisioning, configuration management, intra-service orchestration, and nearly anything a systems administrator does on a weekly or daily basis.
Depending on the infrastructure in place, estimating the time needed to develop the infrastructure as a code can be considered a monstrous job. Fortunately, there are some open-source projects that can be used to make it easier like Former2 or AWSConsoleRecorder. They allow you to generate infrastructure as code outputs from your existing resources within your AWS account. Definitely something to look at if you don't know where to start.
Customize your base image
Managing customized images has many benefits and many disadvantages: it requires to be maintained to ensure a minimum of security. Automation is again the key. Automating the creation of the AMI used on a cloud provider via infrastructure as code principles is important to get a good level of maintainability.
Packer is an open source tool for creating identical machine images for multiple platforms, from a single source configuration. Packer is lightweight, runs on every major operating system, and is highly performant, creating machine images for multiple platforms in parallel. Like Terraform, Packer requires learning the HashiCorp Language (HCL) to develop the template and benefit from almost the same concepts (versioning, testing, automation, etc…). Deployments are much quicker and consistent if they are completely pre-built.
Manage the cost
Cloud Financial Management (aka FinOps) is probably the most controversial aspect of the cloud. FinOps is a cultural practice where everyone should be involved in the cost management of their own cloud infrastructure. It is the most efficient way in the world for teams to manage their cloud costs. Cross-functional teams work together to enable faster delivery, while gaining at the same time more financial and operational control.
This is definitely not the easiest piece and it should not be the last step of a cloud migration. Like security, it is a good practice to optimize the cost as soon as possible. Fortunately, some open-source tools can help in the cost management of a cloud architecture. The first one would probably be the Komizer project, a cloud environment inspector to stay under budget by uncovering hidden costs, monitoring increases in spending, and making impactful changes based on custom recommendations. This tool is really interesting to get a big overview of all the resources created and easily detect potential unused resources. Definitely a first step in cost management. The cloud is based on a "pay-per-use" model, but way too often, it consists of paying for the resources teams forget to delete.
For Terraform users, another tool is interesting to follow, the Infracost project. The purpose of this project is to estimate the potential cost of the architecture and the updates based on the Terraform code. It helps DevOps, SRE and developers to quickly see a cost breakdown and compare different options upfront.
FinOps can obviously not be reduced to that, but having a good observability on used and unused resources, and a quick estimation of any updates is a good beginning. For the AWS users, these two projects can help to clean the resources based on different parameters (tag, name, TTL, etc): awsweeper and aws-auto-cleanup.
Audit to optimize security policies
Like FinOps, the security should be the matter of everyone. The purpose and intent of DevSecOps is to build on the mindset that everyone is responsible for security. IaC introduced a new automated way to manage the lifecycle of every resource and that also means that the policies applied to secure the infrastructure need to be adapted. This means that audits should no longer be done once in a while, but rather in a more orderly way, even as a simple step in an automated production pipeline.
Checkov and Regula are two different projects that can help to automate the audit of potential security breaches in the code itself before applying it. The idea is to scan cloud infrastructure configurations to find misconfigurations before they're deployed. Checkov, for example, can analyze infrastructure as code scan results across platforms such as Terraform, CloudFormation, Kubernetes, Helm, ARM Templates and Serverless framework.
Looking for compliance issues in the code before applying it is not enough, auditing the platform itself is also important. Running periodic audits with external tools like ScoutSuite is useful to detect potential manual misconfigurations. Scout Suite is a multi-cloud security-auditing tool, which enables security posture assessment of cloud environments. Basically it uses cloud providers API to gather configuration data for manual inspection and highlights risk areas.
Another tool that can help and deserve a mention is the Cloud Custodian project. Cloud Custodian is an efficient tool to manage compliance in real time as it uses compliance rules to compare the desired and actual state of cloud resources. Organizations can use Custodian to manage their cloud environments by ensuring compliance to security policies, tag policies, garbage collection of unused resources, and cost management from a single tool. This is definitely a project to follow.
Improve observability with a central logging platform
Observability can be defined as the process of measuring the internal state of a system using the external outputs. Observability provides the ability to explore the data exhibited by different components to understand what and why something happened and, more importantly, provides the ability to predict the future behavior of a system using data analysis and other technologies. Logging is part of the observability and requires good integration in a cloud-native environment.
OpenTelemetry is an open source observability framework. It offers vendor-agnostic or vendor-neutral APIs, software development kits (SDKs) and other tools for collecting telemetry data from cloud-native applications and their supporting infrastructure to understand their performance and health. Perfect component to use in a cloud-native environment to centralize the logs without being locked with the default service provided by the cloud provider.
Control the availability of the resources
Every cloud provider ensures that their services are up and running 99,99% of the time but you have the responsibility to control the availability to "automatically" failover in case of any issue. A continuous monitoring of the availability and the performance of the cloud components is required to achieve that.
Cloudprober is a monitoring software that makes it super-easy to monitor various components of a system. By running active checks, it can for example ensure that the frontend deployed in the cloud can reach the backend deployed on-premise and extract metrics to improve the performances. Cloudprober provides a historical view into the data collected by probes, which allows for correlation when issues occur.
Use Chaos Engineering to break things productively
Cloud architecture comes with modern problems that require modern solutions. The complexity of a cloud architecture makes it impossible to predict and prevent failure scenarios. Modern companies like Netflix improved the concept of chaos engineering to be proactively prepared for failure. Chaos engineering aims to discover cloud failure points, in production systems, before they become disasters.
Following their cloud migration, Netflix developed Chaos Monkey to test system stability by enforcing failures via the pseudo-random termination of instances and services within Netflix's architecture. Thanks to this tool, they were able to "break things on purpose" in order to learn how to build more resilient systems to identify and fix failures before they become public facing outages.
Another project to follow is Litmus, a CNCF sandbox project for practicing chaos engineering in cloud native environments. Litmus provides a chaos-operator, a large set of chaos experiments in its hub, detailed documentation, quick demo, and a friendly community.
Cloud Native applications are, by definition, highly distributed, elastic, resistant to failure and loosely coupled. That is easy to say, but it is better to ensure that it is true and chaos engineering principles can help to ensure that.
Dynamically provision the infrastructure
One strength of the cloud is the ability to dynamically create the resources needed by an application (instance, queue, load balancer, etc). The infrastructure can be managed by an automation tool like Terraform but it can also be managed by the continuous delivery application to centralize the management of the resources and the application.
Spinnaker is an open source, multicloud, continuous delivery platform. It can build flexible pipelines made out of stages to deploy the application the way it needs. The pipeline can have a deployment stage, which orchestrates the creation and cleanup of new infrastructure using a blue/green strategy for zero downtime. The flexibility of pipelines, combined with a comprehensive set of built-in stages, made Spinnaker the perfect tool for DevOps teams.
Crossplane is an open source multicloud control plane that enables engineers to provision infrastructure from the Kubernetes API. It can be used by organizations to build and operate an internal platform-as-a-service (PaaS) across a variety of infrastructures and cloud vendors. Combined with ArgoCD or FluxCD, it allows to apply GitOps principles to constantly monitor infrastructure and fix any drifts there are between the current and desired states.
Did you know that you can test cloud integration offline ?
Yes, that's true, you can reduce the cost of the development environment by testing the integration with the cloud services offline. LocalStack provides an easy way to mock cloud services to develop cloud native applications. It spins up a testing environment in a local container that provides the same functionality and APIs as the real AWS cloud environment.
Forget about managing an entire infrastructure to test the implementation of an application, LocalStack can spin up a local test environment for you. Focus your time on writing application code instead of spending time on setting up the environment to access AWS services.
Learn, test and share
The cloud providers continuously improve their resources, add new services, new features to an existing one, etc… It is important in a changing environment to keep your knowledge up to date and to share it with your colleagues, the community or just keep a trace of what you have done to accomplish a goal.
Obviously, there are many ways to achieve this like one open source project of Google named Cloud Ops Sandbox, The purpose of this project is to help practitioners to learn Service Reliability Engineering practices from Google and apply them on cloud services using Cloud Operations suite of tools. The main goal is to experiment with various Cloud Operations tools to solve problems and accomplish standard SRE tasks in a sandboxed environment.
In most companies, systems were built around different technology stacks, approaches and cloud platforms. The responsibility of the infrastructure has been attributed to CloudOps engineers, and that team is expected to successfully run the systems for years. Using the right tool is probably a real challenge today. It is the responsibility of everyone to identify the right tool (open-source, vendor or cloud service) depending on the context and the use case.
One article is not enough to list all the awesome open-source tools available to make our life easier. Feel free to add a comment to improve this list !
About the authors
Hicham Bouissoumer - Site Reliability Engineer (SRE) - DevOps
Nicolas Giron - Site Reliability Engineer (SRE) - DevOps