Configuring the database default schema

If users don’t define a specific Table (metadata) schema, the Astro SDK will fall back to the global default schema configuration.

There are two options to define the default schema: 1. At a global level, for all databases 2. At a database level, for each specific database

If the user does not configure the database-specific configuration, the Astro SDK will use the global default schema (which has the value tmp_astro if undefined). Example: environment variable :


or by updating Airflow’s configuration

schema = "tmp"

We can also configure the default schema specific to the database type (example: specific to Snowflake, BigQuery, Postgres). If both the default and database-specific schemas are defined, the preference is given to the database-specific value.


or by updating Airflow’s configuration

postgres_default_schema = "postgres_tmp"
bigquery_default_schema = "bigquery_tmp"
snowflake_default_schema = "snowflake_tmp"
redshift_default_schema = "redshift_tmp"
mssql_default_schema = "mssql_tmp"

Configuring if schemas existence should be checked and if the SDK should create them

By default, during aql.load_file and aql.transform, the SDK checks if the schema of the target table exists, and if not, it tries to create it. This type of check can be costly.

The configuration AIRFLOW__ASTRO_SDK__ASSUME_SCHEMA_EXISTS allows users to inform the SDK that the schema already exists, skipping this check for all load_file and transform tasks.

The user can also have a more granular control, by defining the load_file argument assume_schema_exists on a per-task basis :ref:load_file.

Example of how to disable schema existence check using environment variables:


Or using Airflow’s configuration file:

assume_schema_exists = True

Configuring the unsafe dataframe storage

The dataframes (generated by dataframe or transform operators) are stored in XCom table using pickling in the Airflow metadata database. Since this dataframe is defined by the user and if it is huge, it might potentially break Airflow’s metadata DB by using all the available resources. Hence, unsafe dataframe storage should be set to True once you are aware of this risk and are OK with it. Alternatively, you could use a Custom XCom backend to store the XCom data


or by updating Airflow’s configuration

dataframe_allow_unsafe_storage = True

Configuring the storage integration for Snowflake

A storage integration is a Snowflake object that stores a generated identity and access management (IAM) entity for your external cloud storage, along with an optional set of allowed or blocked storage locations (Amazon S3, Google Cloud Storage, or Microsoft Azure). Cloud provider administrators in your organization grant permissions on the storage locations to the generated entity. This option allows users to avoid supplying credentials when creating stages or when loading or unloading data.

Read more at: Snowflake storage integrations


or by updating Airflow’s configuration

snowflake_storage_integration_amazon = "aws_integration"
snowflake_storage_integration_google = "gcp_integration"

Configuring the table autodetect row count

Following configuration indicates how many file rows should be loaded to infer the table columns types. This defaults to 1000 rows.


or by updating Airflow’s configuration

load_table_autodetect_rows_count = 1000

Configuring the RAW SQL maximum response size

Reduce responses sizes returned by aql.run_raw_sql to avoid trashing the Airflow DB if the BaseXCom is used.


or by updating Airflow’s configuration

run_raw_sql_response_size = 1

Configuring the Dataset inlets/outlets

Astro SDK automatically adds inlets and outlets for all the operators if DATASET is supported (Airflow >=2.4).

While users can override it on a task level by adding inlets and outlets, this might be inconvenient for some users who do not want to leverage Data-aware scheduling. Such users can set the following config to False to disable auto addition of inlets and outlets


or by updating Airflow’s configuration

auto_add_inlets_outlets = True

Configuring to emit temp table event in openlineage

Astro SDK has ability to create temporary tables see: Tables.

By default, we emit the temporary tables event in openlineage.

This might be not that useful for some users who do not want to emit such event in openlineage. Such users can set the following config to False to disable it.


or by updating Airflow’s configuration

openlineage_emit_temp_table_event = True

Configuring the native fallback mechanism

The LoadFileOperator has a fallback mechanism when loading data to the database from file storage as explained in How load_file Works.

This fallback can be configured at the task level using enable_native_fallback param.

Users can also control this setting and override the default at a global level (for all tasks) by setting the following config. Set it to True to allow falling back to “pandas” path.


or by updating Airflow’s configuration

load_file_enable_native_fallback = False

Configuring the max memory limit for a Dataframe to be stored in XCom table

If you are using Astro SDK with Airflow >= 2.5, you no longer need to use pickling or a Custom XCom backend to store Astro SDK’s dataset class or dataframes. Airflow will take care of serializing and deserializing them if you have set the following:


or by updating airflow.cfg

allowed_deserialization_classes = airflow\.* astro\.*

The dataframes (generated by dataframe, transform and other functions/operators where you don’t pass output_table) are stored in XCom table if you are not using a Custom XCom backend.

Since this dataframe is defined by the user and if it is huge, it might potentially break Airflow’s metadata DB by using all the available resources.

Hence, the SDK limits the amount of data stored (in kbs) in that table. This is controlled by the following setting:


or by updating airflow.cfg

max_dataframe_mem_for_xcom_db = 100

The value is represented in kbs, the default limit is 100 kb. If a dataframe is less than that, it is stored in the XCom table. If it is greater than that, it is stored in an object store defined by the xcom_storage_conn_id and xcom_storage_url as shown below:

xcom_storage_conn_id = gcp_conn_id
xcom_storage_url = gs://astro_sdk/temp
max_dataframe_mem_for_xcom_db = 100


AIRFLOW__ASTRO_SDK__XCOM_STORAGE_URL = gs://astro_sdk/temp

If all Airflow’s component are on a single machine, by default the xcom_storage_url is the temp directory on the host and you can ignore passing the xcom_storage_conn_id.