astro.files.base

Module Contents

Classes

File

Handle all file operations, and abstract away the details related to location and file types.

Functions

resolve_file_path_pattern(path_pattern[, conn_id, ...])

get file objects by resolving path_pattern from local/object stores

class astro.files.base.File(path, conn_id=None, filetype=None, normalize_config=None)

Handle all file operations, and abstract away the details related to location and file types. Intended to be used within library.

Parameters
  • path (str) –

  • conn_id (str | None) –

  • filetype (constants.FileType | None) –

  • normalize_config (dict | None) –

template_fields = ['location']
property path
Return type

str

property conn_id
Return type

str | None

property size

Return the size in bytes of the given file.

Returns

File size in bytes

Return type

int

is_binary()

Return a constants.FileType given the filepath. Uses a naive strategy, using the file extension.

Returns

True or False

Return type

bool

create_from_dataframe(df)

Create a file in the desired location using the values of a dataframe.

Parameters

df (pandas.DataFrame) – pandas dataframe

Return type

None

export_to_dataframe(**kwargs)

Read file from all supported location and convert them into dataframes.

Due to noted issues with using smart_open with pandas (like https://github.com/RaRe-Technologies/smart_open/issues/524), we create a BytesIO or StringIO buffer before exporting to a dataframe. We’ve found a sizable speed improvement with this optimization.

Return type

pandas.DataFrame

exists()

Check if the file exists or not

Return type

bool

astro.files.base.resolve_file_path_pattern(path_pattern, conn_id=None, filetype=None, normalize_config=None)

get file objects by resolving path_pattern from local/object stores path_pattern can be 1. local location - glob pattern 2. s3/gcs location - prefix

Parameters
  • path_pattern (str) – path/pattern to a file in the filesystem/Object stores, supports glob and prefix pattern for object stores

  • conn_id (str | None) – Airflow connection ID

  • filetype (constants.FileType | None) – constant to provide an explicit file type

  • normalize_config (dict | None) – parameters in dict format of pandas json_normalize() function

Return type

list[File]