Developing the package
Prerequisites
At least Python 3.7, 3.8 or 3.9
(Optional but highly recommended) pyenv
On Apple M1 it is currently required to install postgresql
package. Once compatible wheels are released, you can remove it.
Setup a development environment
To setup your local environment simply run the below statement:
make local target=setup
You will see that there are a series of AWS and Snowflake-based env variables. You really only need these set if you want to test snowflake or AWS functionality.
Finally, let’s set up a toy postgres to run queries against.
We’ve created a docker image that uses the sample pagila database for testing and experimentation.
To use this database please run the following docker image in the background. You’ll notice that we are using port 5433
to ensure that
this postgres instance does not interfere with other running postgres instances.
docker run --rm -it -p 5433:5432 dimberman/pagila-test &
Setup IDE and editor support
nox -s dev
Once completed, point the Python environment to .nox/dev
in your IDE or
editor of choice.
Set up pre-commit hooks
If you do NOT have pre-commit installed, run the following command to get a copy:
nox --install-only lint
and find the pre-commit
command in .nox/lint
.
After locating the pre-commit command, run:
path/to/pre-commit install
Run linters manually
nox -s lint
Run tests
On all supported Python versions:
nox -s test
On only 3.9 (for example):
nox -s test-3.9
Please also note that you can reuse an existing environment if you run nox with the -r
argument (or even -R
if you
don’t want to attempt to reinstall packages). This can significantly speed up repeat test runs.
Build documentation
nox -s build_docs
Check code coverage
To run code coverage locally, you can either use pytest
in one of the test environments or
run nox -s test
with coverage arguments. We use pytest-cov for our coverage reporting.
Below is an example of running a coverage report on a single test. In this case the relevant file is src/astro/sql/operators/sql_decorator.py
since we are testing the postgres transform
decorator.
nox -R -s test -- --cov-report term --cov-branch --cov=src/astro/sql/operators tests/operators/test_postgres_decorator.py
===================================================== test session starts =====================================================
platform darwin -- Python 3.9.10, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /Users/dimberman/code/astronomer/astro-project/plugins/astro, configfile: pyproject.toml
plugins: anyio-3.5.0, requests-mock-1.9.3, split-0.6.0, dotenv-0.5.2, cov-3.0.0
collected 12 items
tests/operators/test_postgres_decorator.py ............ [100%]
====================================================== warnings summary =======================================================
---------- coverage: platform darwin, python 3.9.10-final-0 ----------
Name Stmts Miss Branch BrPart Cover Missing
-------------------------------------------------------------------------------------------------
src/astro/sql/operators/__init__.py 0 0 0 0 100%
src/astro/sql/operators/agnostic_aggregate_check.py 46 32 16 0 26% 61-89, 100-138, 162
src/astro/sql/operators/agnostic_boolean_check.py 66 45 16 0 30% 19-21, 24, 27, 51-65, 80-95, 98-105, 109, 115-128, 131, 149
src/astro/sql/operators/agnostic_load_file.py 56 35 10 0 35% 61-67, 76-101, 106-110, 118-140, 166-167
src/astro/sql/operators/export_file.py 65 43 14 0 30% 64-70, 79-95, 98-109, 112-152, 162-182, 188-190, 220-224
src/astro/sql/operators/agnostic_sql_append.py 50 36 20 0 23% 45-56, 67-85, 90-117
src/astro/sql/operators/agnostic_sql_merge.py 43 28 12 0 31% 48-59, 69-118
src/astro/sql/operators/agnostic_sql_truncate.py 20 11 2 0 50% 32-40, 55-60
src/astro/sql/operators/agnostic_stats_check.py 110 86 32 0 21% 24-27, 32-33, 36-49, 52-73, 76-92, 95-98, 103-119, 122, 125-134, 146-169, 196-216, 231-260, 280
src/astro/sql/operators/sql_dataframe.py 76 13 22 2 79% 83, 130, 160-174
src/astro/sql/operators/sql_decorator.py 201 45 78 16 72% 107-110, 126->128, 137, 166, 175, 194-196, 206->210, 223-224, 228-243, 247-248, 259, 277, 280, 287, 291-293, 296, 311, 315, 322-327, 330-335, 340, 346-363, 380-392
-------------------------------------------------------------------------------------------------
TOTAL 733 374 222 18 46%
Release a new version
Build new release artifacts:
nox -s build
Publish a release to PyPI:
nox -s release
Nox tips
Pass
-R
to skip environment setup, e.g.nox -Rs lint
Pass
-r
to skip environment creation but re-install packages, e.g.nox -rs dev
Find more automation commands with
nox -l
Using a container to run Airflow DAGs
You can configure the Docker-based testing environment to test your DAG
Install the latest versions of the Docker Community Edition and Docker Compose and add them to the PATH.
Run
make container target=build-run
Put the DAGs you want to run in the dev/dags directory:
If you want to add Connections, create a connections.yaml file in the dev directory.
See the Connections Guide for more information.
Example:
druid_broker_default: conn_type: druid extra: '{"endpoint": "druid/v2/sql"}' host: druid-broker login: null password: null port: 8082 schema: null airflow_db: conn_type: mysql extra: null host: mysql login: root password: plainpassword port: null schema: airflow
The following commands are available to run from the root of the repository.
make container target=logs
- To view the logs of the all the containersmake container target=stop
- To stop all the containersmake container target=clean
- To remove all the containers along with volumesmake container target=help
- To view the available commandsmake container target=build-run
- To build the docker image and then run containersmake container target=docs
- To build the docs using Sphinxmake container target=restart
- To restart Scheduler & Triggerer containersmake container target=restart-all
- To restart all the containersmake container target=shell
- To run bash/shell within a container (Allows interactive session)make tilt-up
- To run Tilt (https://tilt.dev/) for local developmentmake tilt-down
- To stop Tilt
Following ports are accessible from the host machine:
8080
- Webserver5555
- Flower5432
- Postgres
Dev Directories:
dev/dags/
- DAG Filesdev/logs/
- Logs files of the Airflow containers
Adding support for a new database
You can use Test Driven approach for adding support for a new database to the Astro Python SDK. You can fulfil the tests for all the Python SDK operators by adding parameters to the existing tests for the database and additionally adding more database specific implementation tests. You can take a look at this PR and PR on how to add parameters for the existing tests for all the operators.
To start with you can take the following steps for the initial configuration:
Add the database name constant to the Database class in the
constants.py
module.Add the database schema constant to the settings.py module. The default schema name is
tmp_astro
. So in case you do not specify your own schema name, your tests will create tables in that schema in your database.Create a
test-connections.yaml
in the python-sdk directory to add connections to the database which will be used by the tests. This file is ignored by git as mentioned in .gitignore , so you may not worry that your secrets will get checked in accidentally. Sampletest-connections.yaml
file would look like the below:- conn_id: gcp_conn conn_type: google_cloud_platform description: null extra: null - conn_id: aws_conn conn_type: aws description: null extra: null - conn_id: redshift_conn conn_type: redshift schema: "dev" host: <YOUR_REDSHIFT_CLUSTER_HOST_URL> port: 5439 login: <YOUR_REDSHIFT_CLUSTER_USER> password: <YOUR_REDSHIFT_CLUSTER_PASSWORD>
Add a mapping of the database name to the connection ID in conftest.py.
Add needed environment variables for the implementation to work by creating a
.env
file and adding those to it. The environment variable could be related to cloud access credentials, development specific airflow config variables, etc.
.env
file could look like the below (you can ask your team to share with you team level credentials for accessing specific sources, if any.)AIRFLOW__CORE__ENABLE_XCOM_PICKLING=True GOOGLE_APPLICATION_CREDENTIALS=<PATH_TO_GOOGLE_SERVICE_ACCOUNT_JSON> AIRFLOW__ASTRO_SDK__DATAFRAME_ALLOW_UNSAFE_STORAGE=True AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY_ID> AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY> ASTRO_CHUNKSIZE=10 GCP_BUCKET=astro-sdk REDSHIFT_NATIVE_LOAD_IAM_ROLE_ARN=<REDSHIFT_NATIVE_LOAD_IAM_ROLE_ARN> ASTRO_PUBLISH_BENCHMARK_DATA=True
For the tests to run in CI, you will need to create appropriate secrets in the Github Actions configuration for the repository. Contact the repository admins @kaxil [Kaxil Naik](mailto: kaxil@astronomer.io) or @tatiana [Tatiana Al-Chueyr](mailto: tatiana.alchueyr@astronomer.io) to add your needed secrets to Github Actions. To make the connections available, create them in .github/ci-test-connection.yaml similar to that in step 3 and refer the secrets from the environment variables which need to be created in .github/workflows/ci-python-sdk.yaml.
Add the database as Python SDK supported database in the test_constants.py module’s test_supported_database() method implementation.
With the above configurations set, you can now proceed for the implementation of supporting all SDK operators in the new database.
The purpose of each of the operators can be found in the Astro SDK Python - Operators document.
As described before, you can use test driven development and run the tests for the operators one by one located in the tests directory.
By default, the base class
implementation methods will be used for the database in the tests. You will need to override some of these methods
to make run the tests successfully. You can create a module for your database in the databases directory.
You can start with running the tests for the load_file
operator. Relevant tests can be found in tests/sql/operators directory.
Tests for load_operator are kept in test_load_file.py.
You might to need to override few base class methods to establish connection to the database based on its semantics.
e.g. sql_type()
, hook()
, sqlalchemy_engine()
, default_metadata()
, schema_exists()
, table_exists()
, load_pandas_dataframe_to_table()
Following are important pointers for implementing the operators:
Investigate how to write a Pandas dataframe to the database and use that in the
load_pandas_dataframe_to_table
implementation.Check what all file types are supported by the database for load and implement support for those. In general, Astro SDK supported file types and file stores can be found here.
Check what merge strategies the database supports and use that in the
merge_table
implementation. Example PR: Merge implementation for Redshift.The default approach to load a file to a database table is to load the table into a Pandas dataframe first and then load the Pandas dataframe into the database table. However, this generally is a slower approach. For optimised loads, you need to try to support native loads. Check what all file types and object stores the database supports for native load and provide support for those. Also, handle/retry the native load exceptions thrown by the corresponding connector library. Example PR: Native load support for Redshift.
You’re recommended to provide example DAGs to guide on the usage relevant to your database in example_dags directory and add those example DAGs as part of integration test suite run in test_example_dags.py.
Additionally, once you have implemented support for all the Astro SDK Python Operators, you also need to benchmark the performance of the file loads. You can refer to the benchmarking guide to generate results and publish those for the database like in PR.