Configuration
Configuring the database default schema
If users don’t define a specific Table
(metadata) schema
, the Astro SDK will fall back to the global default schema configuration.
There are two options to define the default schema: 1. At a global level, for all databases 2. At a database level, for each specific database
If the user does not configure the database-specific configuration, the Astro SDK will use the global default schema (which has the value tmp_astro
if undefined). Example:
environment variable :
AIRFLOW__ASTRO_SDK__SQL_SCHEMA="tmp"
or by updating Airflow’s configuration
[astro_sdk]
schema = "tmp"
We can also configure the default schema specific to the database type (example: specific to Snowflake, BigQuery, Postgres). If both the default and database-specific schemas are defined, the preference is given to the database-specific value.
AIRFLOW__ASTRO_SDK__POSTGRES_DEFAULT_SCHEMA = "postgres_tmp"
AIRFLOW__ASTRO_SDK__BIGQUERY_DEFAULT_SCHEMA = "bigquery_tmp"
AIRFLOW__ASTRO_SDK__SNOWFLAKE_DEFAULT_SCHEMA = "snowflake_tmp"
AIRFLOW__ASTRO_SDK__REDSHIFT_DEFAULT_SCHEMA = "redshift_tmp"
or by updating Airflow’s configuration
[astro_sdk]
postgres_default_schema = "postgres_tmp"
bigquery_default_schema = "bigquery_tmp"
snowflake_default_schema = "snowflake_tmp"
redshift_default_schema = "redshift_tmp"
Configuring the unsafe dataframe storage
The dataframes (generated by dataframe
or transform
operators) are stored in XCom table using pickling in the Airflow metadata database. Since this dataframe is defined by the user and if it is huge, it might potentially break Airflow’s metadata DB by using all the available resources. Hence, unsafe dataframe storage should be set to True
once you are aware of this risk and are OK with it. Alternatively, you could use a Custom XCom backend to store the XCom data
AIRFLOW__ASTRO_SDK__DATAFRAME_ALLOW_UNSAFE_STORAGE = True
or by updating Airflow’s configuration
[astro_sdk]
dataframe_allow_unsafe_storage = True
Configuring the storage integration for Snowflake
A storage integration is a Snowflake object that stores a generated identity and access management (IAM) entity for your external cloud storage, along with an optional set of allowed or blocked storage locations (Amazon S3, Google Cloud Storage, or Microsoft Azure). Cloud provider administrators in your organization grant permissions on the storage locations to the generated entity. This option allows users to avoid supplying credentials when creating stages or when loading or unloading data.
Read more at: Snowflake storage integrations
AIRFLOW__ASTRO_SDK__SNOWFLAKE_STORAGE_INTEGRATION_AMAZON = "aws_integration"
AIRFLOW__ASTRO_SDK__SNOWFLAKE_STORAGE_INTEGRATION_GOOGLE = "gcp_integration"
or by updating Airflow’s configuration
[astro_sdk]
snowflake_storage_integration_amazon = "aws_integration"
snowflake_storage_integration_google = "gcp_integration"
Configuring the table autodetect row count
Following configuration indicates how many file rows should be loaded to infer the table columns types. This defaults to 1000 rows.
AIRFLOW__ASTRO_SDK__LOAD_TABLE_AUTODETECT_ROWS_COUNT = 1000
or by updating Airflow’s configuration
[astro_sdk]
load_table_autodetect_rows_count = 1000
Configuring the RAW SQL maximum response size
Reduce responses sizes returned by aql.run_raw_sql to avoid trashing the Airflow DB if the BaseXCom is used.
AIRFLOW__ASTRO_SDK__RUN_RAW_SQL_RESPONSE_SIZE = 1
or by updating Airflow’s configuration
[astro_sdk]
run_raw_sql_response_size = 1
Configuring the Dataset inlets/outlets
Astro SDK automatically adds inlets and outlets for all the operators if DATASET is supported (Airflow >=2.4).
While users can override it on a task level by adding inlets and outlets, this might be inconvenient for some users who do not want to leverage Data-aware scheduling. Such users can set the following config to False
to disable auto addition of inlets and outlets
AIRFLOW__ASTRO_SDK__AUTO_ADD_INLETS_OUTLETS = True
or by updating Airflow’s configuration
[astro_sdk]
auto_add_inlets_outlets = True