Configuration

Configuring the database default schema

If users don’t define a specific Table (metadata) schema, the Astro SDK will fall back to the global default schema configuration.

There are two options to define the default schema: 1. At a global level, for all databases 2. At a database level, for each specific database

If the user does not configure the database-specific configuration, the Astro SDK will use the global default schema (which has the value tmp_astro if undefined). Example: environment variable :

AIRFLOW__ASTRO_SDK__SQL_SCHEMA="tmp"

or by updating Airflow’s configuration

[astro_sdk]
schema = "tmp"

We can also configure the default schema specific to the database type (example: specific to Snowflake, BigQuery, Postgres). If both the default and database-specific schemas are defined, the preference is given to the database-specific value.

AIRFLOW__ASTRO_SDK__POSTGRES_DEFAULT_SCHEMA = "postgres_tmp"
AIRFLOW__ASTRO_SDK__BIGQUERY_DEFAULT_SCHEMA = "bigquery_tmp"
AIRFLOW__ASTRO_SDK__SNOWFLAKE_DEFAULT_SCHEMA = "snowflake_tmp"
AIRFLOW__ASTRO_SDK__REDSHIFT_DEFAULT_SCHEMA = "redshift_tmp"

or by updating Airflow’s configuration

[astro_sdk]
postgres_default_schema = "postgres_tmp"
bigquery_default_schema = "bigquery_tmp"
snowflake_default_schema = "snowflake_tmp"
redshift_default_schema = "redshift_tmp"

Configuring the unsafe dataframe storage

The dataframes (generated by dataframe or transform operators) are stored in XCom table using pickling in the Airflow metadata database. Since this dataframe is defined by the user and if it is huge, it might potentially break Airflow’s metadata DB by using all the available resources. Hence, unsafe dataframe storage should be set to True once you are aware of this risk and are OK with it. Alternatively, you could use a Custom XCom backend to store the XCom data

AIRFLOW__ASTRO_SDK__DATAFRAME_ALLOW_UNSAFE_STORAGE = True

or by updating Airflow’s configuration

[astro_sdk]
dataframe_allow_unsafe_storage = True

Configuring the storage integration for Snowflake

A storage integration is a Snowflake object that stores a generated identity and access management (IAM) entity for your external cloud storage, along with an optional set of allowed or blocked storage locations (Amazon S3, Google Cloud Storage, or Microsoft Azure). Cloud provider administrators in your organization grant permissions on the storage locations to the generated entity. This option allows users to avoid supplying credentials when creating stages or when loading or unloading data.

Read more at: Snowflake storage integrations

AIRFLOW__ASTRO_SDK__SNOWFLAKE_STORAGE_INTEGRATION_AMAZON = "aws_integration"
AIRFLOW__ASTRO_SDK__SNOWFLAKE_STORAGE_INTEGRATION_GOOGLE = "gcp_integration"

or by updating Airflow’s configuration

[astro_sdk]
snowflake_storage_integration_amazon = "aws_integration"
snowflake_storage_integration_google = "gcp_integration"

Configuring the table autodetect row count

Following configuration indicates how many file rows should be loaded to infer the table columns types. This defaults to 1000 rows.

AIRFLOW__ASTRO_SDK__LOAD_TABLE_AUTODETECT_ROWS_COUNT = 1000

or by updating Airflow’s configuration

[astro_sdk]
load_table_autodetect_rows_count = 1000

Configuring the RAW SQL maximum response size

Reduce responses sizes returned by aql.run_raw_sql to avoid trashing the Airflow DB if the BaseXCom is used.

AIRFLOW__ASTRO_SDK__RUN_RAW_SQL_RESPONSE_SIZE = 1

or by updating Airflow’s configuration

[astro_sdk]
run_raw_sql_response_size = 1

Configuring the Dataset inlets/outlets

Astro SDK automatically adds inlets and outlets for all the operators if DATASET is supported (Airflow >=2.4).

While users can override it on a task level by adding inlets and outlets, this might be inconvenient for some users who do not want to leverage Data-aware scheduling. Such users can set the following config to False to disable auto addition of inlets and outlets

AIRFLOW__ASTRO_SDK__AUTO_ADD_INLETS_OUTLETS = True

or by updating Airflow’s configuration

[astro_sdk]
auto_add_inlets_outlets = True

Configuring to emit temp table event in openlineage

Astro SDK has ability to create temporary tables see: Tables.

By default, we emit the temporary tables event in openlienage.

This might be not that useful for some users who do not want to emit such event in openlienage. Such users can set the following config to False to disable it.

AIRFLOW__ASTRO_SDK__OPENLINEAGE_EMIT_TEMP_TABLE_EVENT = True

or by updating Airflow’s configuration

[astro_sdk]
openlineage_emit_temp_table_event = True

Configuring the native fallback mechanism

The LoadFileOperator has a fallback mechanism when loading data to the database from file storage as explained in How load_file Works.

This fallback can be configured at the task level using enable_native_fallback param.

Users can also control this setting and override the default at a global level (for all tasks) by setting the following config. Set it to True to allow falling back to “pandas” path.

AIRFLOW__ASTRO_SDK__LOAD_FILE_ENABLE_NATIVE_FALLBACK = False

or by updating Airflow’s configuration

[astro_sdk]
load_file_enable_native_fallback = False