astro.files.base

Module Contents

Classes

File

Handle all file operations, and abstract away the details related to location and file types.

Functions

resolve_file_path_pattern(path_pattern[, conn_id, ...])

get file objects by resolving path_pattern from local/object stores

class astro.files.base.File(context=None)

Bases: airflow.utils.log.logging_mixin.LoggingMixin, astro.airflow.datasets.Dataset

Handle all file operations, and abstract away the details related to location and file types. Intended to be used within library.

Parameters:
  • path – Path to a file in the filesystem/Object stores

  • conn_id – Airflow connection ID

  • filetype – constant to provide an explicit file type

  • normalize_config – parameters in dict format of pandas json_normalize() function.

property load_options
Getter of all the load_options. load_options is a container with for the custom option passed by user for a

third-party integrations like pandas, azure etc.

property location: astro.files.locations.base.BaseFileLocation
Return type:

astro.files.locations.base.BaseFileLocation

property load_options_list
property type: astro.files.types.FileType
Return type:

astro.files.types.FileType

property size: int

Return the size in bytes of the given file.

Returns:

File size in bytes

Return type:

int

property openlineage_dataset_namespace: str

Returns the open lineage dataset namespace as per https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

Return type:

str

property openlineage_dataset_name: str

Returns the open lineage dataset name as per https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

Return type:

str

path :str
conn_id :str | None
filetype :constants.FileType | None
normalize_config :dict | None
is_dataframe :bool = False
is_bytes :bool = False
uri :str
extra :dict | None
template_fields = ['path', 'conn_id']
is_binary()

Return a constants.FileType given the filepath. Uses a naive strategy, using the file extension.

Returns:

True or False

Return type:

bool

is_local()

Return a boolean showing whether this file is stored locally or in a cloud storage :return: A boolean for whether the file is local

Return type:

bool

is_pattern()

Returns True when file path is a pattern(eg. s3://bucket/folder or /folder/sample_* etc)

Returns:

True or False

Return type:

bool

create_from_dataframe(df, store_as_dataframe=True)

Create a file in the desired location using the values of a dataframe.

Parameters:
  • store_as_dataframe (bool) – Whether the data should later be deserialized as a dataframe or as a file containing delimited data (e.g. csv, parquet, etc.).

  • df (pandas.DataFrame) – pandas dataframe

Return type:

None

is_directory()
Returns:

A boolean representing whether this path is a directory or not.

Return type:

bool

export_to_dataframe(**kwargs)

Read file from all supported location and convert them into dataframes.

Return type:

pandas.DataFrame

export_to_dataframe_via_byte_stream(**kwargs)

Read files from all supported locations and convert them into dataframes. Due to noted issues with using smart_open with pandas (like https://github.com/RaRe-Technologies/smart_open/issues/524), we create a BytesIO or StringIO buffer before exporting to a dataframe. We’ve found a sizable speed improvement with this optimization.

Return type:

pandas.DataFrame

exists()

Check if the file exists or not

Return type:

bool

to_json()
classmethod from_json(serialized_object)
Parameters:

serialized_object (dict) –

astro.files.base.resolve_file_path_pattern(path_pattern, conn_id=None, filetype=None, normalize_config=None, load_options=None)

get file objects by resolving path_pattern from local/object stores path_pattern can be 1. local location - glob pattern 2. s3/gcs location - prefix

Parameters:
  • path_pattern (str) – path/pattern to a file in the filesystem/Object stores, supports glob and prefix pattern for object stores

  • conn_id (str | None) – Airflow connection ID

  • filetype (constants.FileType | None) – constant to provide an explicit file type

  • normalize_config (dict | None) – parameters in dict format of pandas json_normalize() function

  • load_options (list[LoadOptions] | None) –

Return type:

list[File]