astro.files.base
Module Contents
Classes
Handle all file operations, and abstract away the details related to location and file types. |
Functions
|
get file objects by resolving path_pattern from local/object stores |
- class astro.files.base.File(context=None)
Bases:
airflow.utils.log.logging_mixin.LoggingMixin
,astro.airflow.datasets.Dataset
Handle all file operations, and abstract away the details related to location and file types. Intended to be used within library.
- Parameters:
path – Path to a file in the filesystem/Object stores
conn_id – Airflow connection ID
filetype – constant to provide an explicit file type
normalize_config – parameters in dict format of pandas json_normalize() function.
- property load_options
- Getter of all the load_options. load_options is a container with for the custom option passed by user for a
third-party integrations like pandas, azure etc.
- property location: astro.files.locations.base.BaseFileLocation
- Return type:
- property load_options_list
- property type: astro.files.types.FileType
- Return type:
astro.files.types.FileType
- property size: int
Return the size in bytes of the given file.
- Returns:
File size in bytes
- Return type:
int
- property openlineage_dataset_namespace: str
Returns the open lineage dataset namespace as per https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md
- Return type:
str
- property openlineage_dataset_name: str
Returns the open lineage dataset name as per https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md
- Return type:
str
- path: str
- conn_id: str | None
- filetype: astro.constants.FileType | None
- normalize_config: dict | None
- is_dataframe: bool = False
- is_bytes: bool = False
- uri: str
- extra: dict | None
- template_fields = ('path', 'conn_id')
- is_binary()
Return a constants.FileType given the filepath. Uses a naive strategy, using the file extension.
- Returns:
True or False
- Return type:
bool
- is_local()
Return a boolean showing whether this file is stored locally or in a cloud storage :return: A boolean for whether the file is local
- Return type:
bool
- is_pattern()
Returns True when file path is a pattern(eg. s3://bucket/folder or /folder/sample_* etc)
- Returns:
True or False
- Return type:
bool
- create_from_dataframe(df, store_as_dataframe=True)
Create a file in the desired location using the values of a dataframe.
- Parameters:
store_as_dataframe (bool) – Whether the data should later be deserialized as a dataframe or as a file containing delimited data (e.g. csv, parquet, etc.).
df (pandas.DataFrame) – pandas dataframe
- Return type:
None
- is_directory()
- Returns:
A boolean representing whether this path is a directory or not.
- Return type:
bool
- export_to_dataframe(**kwargs)
Read file from all supported location and convert them into dataframes.
- Return type:
pandas.DataFrame
- export_to_dataframe_via_byte_stream(**kwargs)
Read files from all supported locations and convert them into dataframes. Due to noted issues with using smart_open with pandas (like https://github.com/RaRe-Technologies/smart_open/issues/524), we create a BytesIO or StringIO buffer before exporting to a dataframe. We’ve found a sizable speed improvement with this optimization.
- Return type:
pandas.DataFrame
- exists()
Check if the file exists or not
- Return type:
bool
- to_json()
- classmethod from_json(serialized_object)
- Parameters:
serialized_object (dict) –
- astro.files.base.resolve_file_path_pattern(path_pattern, conn_id=None, filetype=None, normalize_config=None, load_options=None)
get file objects by resolving path_pattern from local/object stores path_pattern can be 1. local location - glob pattern 2. s3/gcs location - prefix
- Parameters:
path_pattern (str) – path/pattern to a file in the filesystem/Object stores, supports glob and prefix pattern for object stores
conn_id (str | None) – Airflow connection ID
filetype (astro.constants.FileType | None) – constant to provide an explicit file type
normalize_config (dict | None) – parameters in dict format of pandas json_normalize() function
load_options (list[astro.options.LoadOptions] | None) –
- Return type:
list[File]