AEP-4 Astro CLI and SQL CLI integration

Authors:

Tatiana Al-Chueyr (tatiana.alchueyr@astronomer.io)
Pankaj Koti (pankaj.koti@astronomer.io)

Status


State	Accepted
Discussion	https://github.com/astronomer/astro-sdk/issues/828
Created	2022-09-29

Motivation

A requirement for the SQL CLI 0.1 delivery is for it to be available from within the Astro CLI. This document discusses the approaches for doing this integration.

Scope

There are three main aspects of the integration:

Packaging
Interface
Add-ons

Assumptions and Limitations

1. Packaging

Astronomer uses the Go programming language to develop the Astro CLI, whereas SQL CLI is developed using Python.

There are a few options for calling Python code from Go: a) Docker-container b) System shared library (.dll / .lib.so) c) (Native) operating system package (Brew recipe / .dmg / .deb / .rpm)

We’ll analyse each of these options.

a) Docker-container

Currently, the Astro CLI uses this approach for multiple commands, including pytest. The idea is that we create a Python library, make it available in PyPI and have a Docker container to install it.

By implementing this approach, we would run commands (forwarding parameters passed to Astro CLI) from within Docker. This pattern is vastly used within the Astro CLI project itself and Astronomer developers are familiar with maintaining it.

Pros:

Reliability: Docker isolates the environment from the operating system. This gives us control of the Python version and any other system dependencies the SQL CLI may need.
Maintenance: the SQL CLI team only needs to worry about maintaining the Python library and docker image. It’s easy from a maintenance perspective.
Decoupling: from a packaging perspective, this approach makes the Astro CLI and the SQL CLI decoupled and independent.

Cons

System-dependencies: it relies on the user having Docker pre-installed
Execution speed: this approach is slower from a user’s perspective (pulling docker image)
Troubleshooting: in the past, Astro CLI developers reported having challenges debugging things when they had issues due to the layer in-between the user and the system. Memory-consumption: there is an overhead with memory consumption (by Docker)

b) System shared-library

It is possible to use cgo to interact with Python, from Go, as a shared library. There are a few blog posts which discuss this approach (example 1 and example 2.

Pros

Memory-consumption: uses the least memory, supports sharing memory between Python and Go
Execution speed: probably the fastest from a user perspective
Troubleshooting: removes the Docker layer in between. Still, it introduces a thinner wrapping layer.

Cons

System dependencies: it relies on the user having Python pre-installed
Reliability: Relies on system dependencies, which may vary from user to user
Maintenance: the SQL CLI team must maintain a Python library and the Go interface to communicate with it. For each operating system (Linux, Mac, Windows), we’d need a dedicated, dynamic library.
Coupling: from a packaging perspective, this approach makes the Astro CLI and the SQL CLI very coupled. We would have to change how we build the Astro CLI, if we wanted to share a memory, we would have to handle garbage collection on the Go code, and it seems we’d have to write more Go boilerplate code to interface with Python.

c) (Native) OS package

This strategy involves building an operating system package, which may be native (e.g. .deb for Ubuntu and Debian, or .dmg for MacOS) or not (such as a brew recipe for MacOS).

As of now, there are lots of tools which make this process easy (Homebrew docs and Debian, as examples).

In this case, we’d be installing a SQL CLI client in the end-user operating system, and the Astro CLI would invoke this binary.

Pros

Reliability: it doesn’t need to rely on anything pre-installed; the package can detail all the dependencies and resolve them
System-dependencies: it doesn’t need to rely on anything pre-installed; the package can detail all the dependencies and resolve them.
Memory-consumption: less memory overhead than Docker
Execution speed: faster than docker, probably slower than the dynamic library
Troubleshooting: no additional layers in between - simplest to troubleshoot
Decoupling: from a packaging perspective, this approach makes the Astro CLI and the SQL CLI very decoupled.

Cons

Maintenance: the SQL CLI team would need to maintain a Python library and, for each operating system support, a dedicated package. There is an overhead with the initial development, but once the first package is shipped, it is easy and cheap to maintain. We need to ensure the Astro CLI can install the correct binary version/operating system.

Overview

	Docker	OS Shared Library	OS Package
Reliability	★★★★★	★★	★★★★★
System-dependencies	★★	★★	★★★★★
Memory-consumption	★★	★★★★★	★★★★
Troubleshooting	★★	★★★★★	★★★★
Decoupling	★★★	★★★	★★★★★
Maintenance	★★★★★	★★	★★★

Recommendation

From a maintenance and implementation perspective, it is easier and quicker to deliver the SQL CLI using approach 1(a), by implementing a Python library and using Docker to integrate with the Astro CLI.

From a user’s perspective, approach 1(c) gives them a better experience - allowing them to not worry about system dependencies and having a lightweight way of installing and running commands related to SQL workflows.

Given that time-to-market is important, for release 0.1 of the SQL CLI, we will adopt approach 1(a). After that, we can discuss if we want to implement 1(c), which could improve the overall user experience.

2. Interface

After several sessions with the Astro CLI team, we were able to identify three ways of aligning the interface of the SQL CLI and Astro CLI:

a) New sub-command b) One new context c) Two new contexts d) Feature flag

Below there are more details about each of them and a recommendation.

a) New sub-command

All the SQL CLI commands would be available from within a sub-command of the Astro CLI (the name is to be defined - could be sql , workflow, or something else).

This approach is the least aligned with the product expectations because it gives a non-unified experience between the SQL and the existing Astro CLI interface.

Users would follow these steps:

    $ astro flow init
    $ astro flow validate
    $ astro flow run # pre-generate the DAG from the SQL dir & run locally
    $ astro flow deploy # do we want pre-generate the DAG from the SQL dir

Pros

SQL-interface consistency: All SQL-related commands are grouped, so the end-user only needs to worry about one sub-command
Integration: Easy to implement - the coupling with the Astro CLI can be as lightweight as we want (some arguments can be hard-coded in the Astro CLI, others may just be forwarded). This would allow us to have additional features without the need to change the Astro CLI code.
Context-aware: works in conjunction with any context, including the existing software and cloud contexts
Isolation: We’d only include in the project structure what is needed for the SQL CLI, without adding python-specific files
Maintenance: easy from a DAG authoring team, since most of the code will be in Python and we will have full control of the SQL CLI itself. The implementation could be similar to pytest:
- Initially we would have our own Dockerfile - in future be part of Runtime
- Container: https://github.com/astronomer/astro-cli/blob/3c18bac4e378ebce3949af3cac2a0874009bc74b/airflow/container.go#L27
- Docker: https://github.com/astronomer/astro-cli/blob/3c18bac4e378ebce3949af3cac2a0874009bc74b/airflow/docker.go#L349-L459

Cons

Astro-interface consistency: we would have a new subset of commands which are not the same as existing Astro CLI users are used to
Re-work: we wouldn’t be reusing parts of the Astro CLI which we could benefit from (including, but not limited to, the deployment to Astro-cloud)

b) One new context

The Astro-CLI allows developers to create dedicated contexts to override the behaviour of existing commands. Currently, this feature allows the same command to do different things, including deploying to different installations of Astronomer. Existing contexts: cloud (used by most Astronomer customers) and software (used by JPMC, for example).

By adopting this approach, we’d add a new context (e.g. flow), which would change the behaviour of the existing commands.

Users would follow these steps:

    $ astro context switch sql.astronomer.io
    $ astro dev init  # overridden
    $ astro flow validate
    $ astro flow run
    $ astro deploy  # overridden

Pros

Astro-interface consistencyi: users who are already used to the Astro CLI would be able to reuse some of the existing commands, including astro dev init and astro deploy.
Re-work: we can reuse current features of the Astro CLI, such as deployment, without the need for any rework
Integration: already done before, simple to integrate with
Isolation: We could only include in the project structure what is needed for the SQL CLI, without adding python-specific files. However, this may be confusing to users used to the command line in the Python context.
Maintenance: Medium. The following parts of code would have to be maintained:
- A folder (similar to software and cloud) with all the commands, linking to either cloud or software https://github.com/astronomer/astro-cli/tree/main/cmd/software)
- This would include tests written in Go for these

Cons

SQL-interface consistency: The SQL-related commands are not grouped. The user needs to use different subcommands for different features
Context-aware: Since it is a new context and the Astro CLI only supports one context at a time, we would need to decide between either supporting the cloud or the software context, not allowing a subset of Astronomer customers to use it.

c) Two new contexts

Very similar to (b), except we’d have a dedicated SQL context for each of the other contexts.

Users would follow these steps:

    $ astro context switch flow-cloud.astronomer.io
    $ astro dev init  # overridden
    $ astro flow validate
    $ astro flow run
    $ astro deploy  # overridden,

Whereas users used the software context would use:

$ astro context switch flow-software.astronomer.io

Pros

Astro-interface consistency: users who are already used to the Astro CLI would be able to reuse some of the existing commands, including astro dev init and astro deploy.
Re-work: we can reuse current features of the Astro CLI, such as deployment, without the need for any rework
Integration: already done before, simple to integrate with
Isolation: We could only include in the project structure what is needed for the SQL CLI, without adding python-specific files. However, this may be confusing to users used to the command line in the Python context.
Context-aware: Would allow both software and cloud users to use the SQL feature

Cons

SQL-interface consistency: The SQL-related commands are not grouped. The user must use different subcommands for different features
Maintenance: double the work, feels like a hack

d) Feature flag

We would define a feature flag which, if enabled, would make the Astro CLI invoke - internally - SQL CLI commands.

Users would follow these steps:

    $ export ASTRO_ENABLE_SQL=1
    $ astro dev init  # enriched
    $ astro flow validate
    $ astro flow run
    $ astro deploy  # enriched

Pros

Astro-interface consistency: users who are already used to the Astro CLI would be able to reuse some of the existing commands, including astro dev init and astro deploy.
Re-work: we can reuse current features of the Astro CLI, such as deployment, without the need for any rework
Integration: already done before
Context-aware: Supports all contexts

Cons

Isolation: high coupling between the Astro CLI and SQL CLI. The SQL CLI could break Astro CLI users’ experience.
SQL-interface consistency: The SQL-related commands are not grouped. The user must use different subcommands for different features
Maintenance: it is more coupled to the Astro CLI, but the maintenance would be smaller than the context, but we would still have duplication, For instance, we would have to add if to all the available contexts: https://github.com/astronomer/astro-cli/blob/3c18bac4e378ebce3949af3cac2a0874009bc74b/cmd/cloud/deploy.go#L110 and respective tests
Prone to error: if user opens a new terminal or forgets setting the environment variable

e) Command flag

We would define a flag which, if enabled, would make the Astro CLI invoke - internally - SQL CLI commands.

Users would follow these steps:

    $ astro dev init --flow # enriched
    $ astro flow validate
    $ astro flow run
    $ astro deploy --flow # enriched, this could be confusing (if the flag is not passed, would we still be deploying the DAGs which were pre-generated before?)

Pros

Astro-interface consistency: users who are already used to the Astro CLI would be able to reuse some of the existing commands, including astro dev init and astro deploy.
Re-work: we can reuse current features of the Astro CLI, such as deployment, without the need for any rework
Integration: already done before
Context-aware: Supports all contexts

Cons

Isolation: high coupling between the Astro CLI and SQL CLI. The SQL CLI could break Astro CLI users’ experience.
SQL-interface consistency: The SQL-related commands are not grouped. The user must use different subcommands for different features
Maintenance: it is more coupled to the Astro CLI, but the maintenance would be smaller than the context, but we would still have duplication, For instance, we would have to add if to all the available contexts: https://github.com/astronomer/astro-cli/blob/3c18bac4e378ebce3949af3cac2a0874009bc74b/cmd/cloud/deploy.go#L110 and respective tests
Prone to error: if the user forgets adding the flag to deploy, for instance, they won’t deploy SQL DAGs

Overview

	Sub-command	One context	Two contexts	Feature flag/ command flag
Astro-interface consistency	★★	★★★★★	★★★★★	★★★★★
SQL-interface consistency	★★★★★	★★	★★	★★
Integration	★★★★★	★★★★★	★★★★★	★★★★★
Isolation	★★★★★	★★	★★	★★
Re-work	★★	★★★★★	★★★★★	★★★★★
Context-aware	★★★★★	★★	★★★	★★★★★
Maintenance	★★★★★	★★★	★★	★★★

Recommendation

Our recommendation is for us to start by implementing the sub-command approach 2(a). This will make the development more decoupled, which is useful, particularly during the alpha/beta development. This approach will give a more unified experience to SQL users, with the downside of being less connected to the existing Astro CLI experience.

We will revisit this for Beta. The feature flag/command flag approaches 2(d) and 2(e) may prove to be more suitable for a unified Astro CLI user experience.

3. Add-ons

The Astro CLI already creates several:

Directory
Files

It also supports building docker containers using those files and directories and deploying them to the Astronomer cloud.

The current directory structure created when we use the astro dev init command is:

.
├── airflow_settings.yaml
├── dags
│   ├── example_dag_advanced.py
│   └── example_dag_basic.py
├── Dockerfile
├── include
├── packages.txt
├── plugins
├── README.md
├── requirements.txt
└── tests
    └── dags
        └── test_dag_integrity.py

Most of these things are not meaningful from a SQL workflow author - since they are very Python-focused.

The files and directories that matter from a SQL perspective are:

Environment-specific configuration files which contain only properties that the SQL analysts care about (e.g. connections)
Directory which will contain the SQL files
Sample SQL statements
Output directory where the materialised Python DAGs will be saved to, which could even be a “hidden” directory and set as the Airflow home directory, so SQL analysis don’t have to worry about it.

To keep or not the files from the Python project is a decision related to (2). So this section focuses on proposing the minimal files we need for delivering the SQL SDK 0.1.

Recommendation

This is a recommendation of what the SQL CLI should be creating:

sql-test/
    config/
        default/
            configuration.yml
        dev/
            configuration.yml
    workflows/
        workflow_example1/
            example_1a.sql
            example_1b.sql
    .dags/

The initial configuration file could look like:

connections:
  - conn_id: postgres_conn
    conn_type: postgres
    host: localhost
    schema:
    login: postgres
    password: postgres
    port: 5433
    extra:

And it should support referencing from environment variables, example:

connections:
  - conn_id: postgres_conn
    conn_type: postgres
    host: localhost
    schema:
    login: postgres
    password: ${POSTGRES_CONN_PASSWORD}
    port: 5433
    extra: