This article is an opinionated SRE point of view of an open source stack to easily request, graph, audit and secure any kind of data access of multiple data sources. This post is the first part of a series of articles dedicated to MLOps topics. So, let’s start with the theory!

What Is Trino?

Trino is an open-source distributed SQL query engine that can be used to run ad hoc and batch queries against multiple types of data sources. Trino is not a database, it is an engine that aims to run fast analytical queries on big data file systems (like Hadoop, AWS S3, Google Cloud Storage, etc), but also on various sources of distributed data (like MySQL, MongoDB, Cassandra, Kafka, Druid, etc). One of the great advantages of Trino is its ability to query different datasets and then join information to facilitate access to data.

Trino, by its ability to offer a centralized entry point to the different database systems allows:

The developers to avoid the development, duplication and maintenance of code necessary to connect to the different database management systems
The administrators to facilitate the maintenance of the various database systems thanks to this abstraction of the infrastructure that Trino offers. Applications no longer connect directly to a group of data servers but to a pool of resources that can be dynamically configured (more details in the article).

Trino is supported by the Trino Software Foundation, an independent nonprofit organization with a dedicated mission to maintain, promote and manage the advancement of Trino’s distributed SQL query engine for Big Data.

What Is Apache Superset?

Apache Superset is a Cloud Native Business Intelligence tool that collects and processes large volumes of data to produce visualized results like graphs. Superset is a web-based application that allows users to generate dashboards and reports to help businesses grow.

Superset is often compared to Tableau, Looker or Metabase in the way that it can efficiently connect to several datasource types like Druid, Google Big Query, Click House, PostgreSQL etc, and obviously Trino to graph data and easily embed them in external applications..

Superset is a scalable and secure central user-friendly interface for anyone in the company who wants to query and represent data..

What Is Apache Ranger?

Apache Ranger is probably the leading open source project to enable, monitor and manage data access governance for Big Data environments. This application is a famous tool in the Hadoop/Hive (and related object storage systems like Trino) ecosystem to improve the granularity of data access management on these distributed systems.

Its integration with an external user directory (such as LDAP) and its intuitive web interface make it easier to define and manage security policies, facilitating the performance of security audits to control access to data.

Why This Stack?

When a project becomes mature, it very often happens that its architecture becomes complex through the usage of different database management systems. This complexity is felt both in terms of infrastructure maintenance and data accessibility management. The operations teams must manage the various platforms on a daily basis (maintenance, updates, security, automation, observability, etc.) and the development teams must maintain (and very often duplicate) the code needed for data access. Added to this is the need for teams to communicate and work together to keep systems up to date while minimizing impacts on applications in production.

The purpose of this stack is to facilitate access to data stored in different database systems by minimizing the necessary maintenance while keeping in mind a certain level of security.

To benefit from all the advantages a central database system offers, important guidelines need to be followed:

The stacks used to manage the data can be fully automated. No more manual actions on a database server to quickly fix something. Everything must be automated from the installation of the database engine to the creation of the database, the backup, the permissions, etc, everything.
Automation means processes like onboarding and offboarding people. Two processes that must be mastered to ensure the minimum level of security. Centralizing the access to one tool makes it obviously easier to operate.
Manage the granularity of permissions. Read, write, administration roles must be defined per application and user to easily audit the platform and ensure compliance with security policies.
Centralize the permission management with a central user directory like LDAP or an identity provider like KeyCloak to force authentication policies like password format, retention period, rotation, etc. Centralizing accesses makes operation life easier.
Facilitate performance and security audits with native features that give in real time the queries executed with their metrics (query execution duration, owner of the query, etc).
Minimize the stack that people need to request the data of different data sources. - Having one central secured entrypoint to access the data is easier to manage than having multiple tools to maintain.
Minimize the effort to get access to the data. Developers can benefit from a common shared library minimizing the code required to access the different datasource. All the applications can use the same connector to access the database engines minimizing duplication and efforts to maintain the code.
Cache data and queries results to potentially improve performance and reduce requests on remote database engines.
Break the link between a client and the remote server or cluster. Having an abstraction of the resources used by any database system allows operators to easily run administrative tasks (like rolling update a cluster) with minimal impacts on production.
Distribute the load with a stack that can be easily scaled to improve high accessibility of the data.
Facilitate the daily life of everybody by using one “language” (SQL) to query different types of data sources.
Run queries on several data engines and aggregate the data before getting the result on the application.

How To Deploy The Stack On Kubernetes?

This section is not intended to detail the process of deployment and configuration of the stack, it will be the subject of a next article. The goal is to list several points to take into account in the management of this type of application. The main objective is to deploy a scalable, highly available and secure architecture.

Kubernetes is obviously an interesting platform in this kind of situation in the sense that it brings a number of advantages by its structure and integration. Kubernetes is now mature and makes it possible, in the MLOps world, to facilitate the management of distributed data systems such as Spark, Hadoop, Kafka and obviously Trino and Superset.

Kubernetes provides several benefits in machine learning and data management to reduce risks:

Network access control through network policies. Data access management is primarily about network accessibility,
The observability (metrics, logs and traces) of this type of platform allowing to profile the stack usage and adjust accordingly the allocated resources (memory, processor, disk but also worker nodes),
Following the previous point, the definition of horizontal and vertical auto-scaling rules to ensure continuity of service while controlling the budget. Remember that a data processing stack can very quickly become expensive so don’t deploy any application without resource allocation limits,
Take advantages of the Cert Manager to manage your certificates and secure the endpoints,
Secure direct access to the stack and its behavior through the Open Policy Agent,
Get the benefits of GitOps by moving your source of truth to your Git source code manager to ensure that stack management is coded, versioned, and validated before and after deployment,
Dynamically and securely manage sensitive data such as passwords or configurations with a Vault agent deployed on the cluster or Mozilla SOPS, Minimize the impact of updates thanks to canary deployment by combining the principles of rolling update and service mesh,
Distribute the workload on one or more clusters according to demand with Istio and a distributed storage service, Benefit from the automatic encryption of the volumes used by the stack to ensure data security (depends obviously on the container storage interface used on the cluster).

This is just a small list of the many advantages of Kubernetes in the world of data management, there are many others that obviously depend on the context in which the platform is used.

Nevertheless, a highly recommended practice is to have one stack per environment. Even if it is possible to manage several environments at once on the same cluster, it is advisable to separate them to minimize the impact of one environment on the other (for example, the impact of developments or tests on a production environment).

Next?

No more theory, the next article will show how to put all this into practice. Stay tuned!

For more information on topics mentioned in this article, please refer to this documentation:

About The Authors

Hicham Bouissoumer — Site Reliability Engineer (SRE) — DevOps

Nicolas Giron — Site Reliability Engineer (SRE) — DevOps

KumoMind's Blog