Module preprocessor.abstract

The abstract classes in this modular define the interfaces used by concrete classes defined by in this package or custom preprocessors.

Classes

class IsDict

Abstract base for any object that can be converted into and out of a dict. If schema validation is possible for the derived type, the schema definition should be in a class variable named SCHEMA.

Ancestors

  • abc.ABC

Subclasses

Static methods

def from_dict(input: dict) -> Any

Converts a Python dict into the type.

Args

input : dict
The output of a serializer like json.load()

Returns

Any
The specific type which implemented this interface.

Methods

def to_dict(self) -> dict

Converts the type into a Python dict for serialization.

Returns

dict
Python dict which can be passed to serializers like json.dump()
class NumpyPreprocessor

Abstract base for any preprocessor that can generate a numpy.ndarray

Ancestors

  • abc.ABC

Subclasses

Methods

def read_asset(self, asset: str | pathlib.Path | Package) -> numpy.ndarray

Reads a preprocessor.package file and coerces it into an ndarray.

Args

asset (Union(str, Path, or Package)): File (or Package) pointing to the asset on disk.

Returns

np.ndarray
The content of the package file as an ndarray
def read_asset_chunked(self, asset: str | pathlib.Path | Package, chunksize: int) -> Iterator[numpy.ndarray]

Reads in chunks from a tabular preprocessor.package file and coerces it into a pandas.DataFrame.

Args

asset (Union(str, Path, or Package)): File (or Package) pointing to the asset on disk.
chunksize : int
The number of tabular rows to include in each chunk.

Yields

Iterator[np.ndarray]
An ndarray with at most the number of rows defined by the chunk size.
def read_bytes(self, data: _io.BytesIO) -> numpy.ndarray

Reads a BytesIO source and coerces it into an ndarray.

Args

data
The BytesIO source containing the input data bytes.

Returns

np.ndarray
The byte content as an ndarray
def read_file(self, path: str | pathlib.Path | Package) -> numpy.ndarray

Reads a file from disk and coerces it into an ndarray

Args

path : Union[str, Path, Package]
Reference to the file on disk.

Returns

np.ndarray
The content of the file as an ndarray
def read_folder(self, pattern: str) -> Iterator[numpy.ndarray]

Loads files from a folder based on a glob pattern.

Args

pattern
A glob pattern used to select files to read. Ex. "data/sample.*"

Yields

Iterator[np.ndarray]
Data represented as an ndarray.
class NumpyTargetPreprocessor

Abstract base preprocessor that can generate a NumPy ndarray object with a target.

Ancestors

  • abc.ABC

Subclasses

Methods

def read_asset(self, asset: str | pathlib.Path | Package) -> Tuple[numpy.ndarray, numpy.ndarray | None]

Reads a Package file from disk and coerces it into a tuple of ndarrays.

Args

asset (Union(str, Path, or Package)): File (or Package) pointing to the asset on disk.

Returns

Tuple[np.ndarray, Optional[np.ndarray]]
The content of the file as where the first value is a 2D features array, and the second is a 1D target array.
def read_asset_chunked(self, asset: str | pathlib.Path | Package, chunksize: int) -> Iterator[Tuple[numpy.ndarray, numpy.ndarray | None]]

Reads a large Package file from disk and coerces it into a pandas.DataFrame.

Args

asset (Union(str, Path, or Package)): File (or Package) pointing to the asset on disk.
chunksize
The number of tabular rows that will be included in each chunk.

Yields

Iterator[Tuple[np.ndarray, Optional[np.ndarray]]]
A tuple of numpy arrays, each of which has at most the number of rows defined by the chunk size.
def read_bytes(self, data: _io.BytesIO) -> Tuple[numpy.ndarray, numpy.ndarray | None]

Reads from a BytesIO source and coerces it into a tuple of ndarrays.

Args

data
The BytesIO source containing the input data bytes.

Returns

Tuple[np.ndarray, Optional[np.ndarray]]
Where the first value is a 2D features array, and the second is a 1D target array.
def read_file(self, path: str | pathlib.Path | Package) -> Tuple[numpy.ndarray, numpy.ndarray | None]

Loads a file from disk and coerces into a tuple of ndarrays.

Args

path : Union[str, Path, Package]
Reference to the file on disk.

Returns

Tuple[np.ndarray, Optional[np.ndarray]]
The content of the file as where the first value is a 2D features array, and the second is a 1D target array.
def read_folder(self, pattern: str) -> Iterator[Tuple[numpy.ndarray, numpy.ndarray | None]]

Creates a generator for multiple files matching a glob pattern from a folder into a tuple of ndarrays.

Args

pattern
A glob pattern used to select files to read. Ex. "data/sample.*"

Yields

Tuple[np.ndarray, Optional[np.ndarray]]
Where the first value is a 2D features array, and the second is a 1D target array.
class OutputNumpy

Abstract base for a preprocessor that can output data as a numpy.ndarray

Ancestors

Subclasses

Methods

def output_numpy(self) -> NumpyPreprocessor

Completes the builder by returning a constructed NumpyPreprocessor

Returns

A NumpyPreprocessor

Inherited members

class OutputNumpyTarget

Abstract base for a preprocessor that can output data as a numpy.ndarray

Ancestors

Subclasses

Methods

def output_numpy_target(self) -> NumpyTargetPreprocessor

Completes the builder by returning a constructed NumpyTargetPreprocessor

Returns

A NumpyTargetPreprocessor

Inherited members

class OutputPandas

Abstract base for a preprocessor that can output data as a pandas.dataframe

Ancestors

  • abc.ABC

Subclasses

Methods

def output_pandas(self) -> PandasPreprocessor

Completes the builder by returning a constructed PandasPreprocessor

Returns

A PandasPreprocessor

class OutputTorchDataset

Abstract base for a preprocessor that can output data as a torch.dataset

Ancestors

Subclasses

Methods

def output_torch_dataset(self) -> TorchDatasetPreprocessor

Completes the builder by returning a constructed TorchDatasetPreprocessor

Returns

A TorchDatasetPreprocessor

Inherited members

class PandasPreprocessor

Abstract base for any preprocessor that results in a pandas.DataFrame.

Ancestors

  • abc.ABC

Subclasses

Methods

def read_asset(self, asset: str | pathlib.Path | Package) -> pandas.core.frame.DataFrame

Reads a preprocessor.package file from disk and coerces it into a pandas.DataFrame.

Args

asset : Union[str, Path, or Package]
File (or Package) pointing to the asset on disk.

Returns

The content of the package file as a pandas.DataFrame.

def read_asset_chunked(self, asset: str | pathlib.Path | Package, chunksize: int) -> Iterator[pandas.core.frame.DataFrame]

Reads a Package file from disk and coerces it into a pandas.DataFrame.

Args

asset (Union(str, Path, or Package)): File (or Package) pointing to the asset on disk.
chunksize
The number of tabular rows that will be included in each chunk.

Returns

An iterator of pandas.DataFrames, each of which has at most the number of rows defined by the chunk size.

def read_bytes(self, data: _io.BytesIO) -> pandas.core.frame.DataFrame

Reads a BytesIO source and coerces it into a pandas.DataFrame.

Args

data
The BytesIO source containing the input data bytes.

Returns

The content of the file as a pandas.DataFrame.

def read_file(self, asset: str | pathlib.Path | Package) -> pandas.core.frame.DataFrame

Reads a file form disk and coerces it into a pandas.DataFrame.

Args

asset : Union[str, Path, or Package]
File (or Package) pointing to the asset on disk.

Returns

The content of the file as a pandas.DataFrame.

def read_folder(self, pattern: str) -> Iterator[pandas.core.frame.DataFrame]

Creates a generator which loads files from a folder based on a glob pattern.

Args

pattern
A glob pattern used to select files to read. Ex. "data/sample.*"

Yields

Iterator[pd.DataFrame]
Data represented as a pandas.DataFrame.
class RequirePropertyDtype

Helper class that provides a standard way to create an ABC using inheritance.

Ancestors

  • abc.ABC

Subclasses

Methods

def dtype(self, dtype: str | None)

Casts an output numpy array to a given dtype.

If unset, the Protocol will choose. Ignored for non-numpy outputs.

Args

dtype
The dtype that a numpy output will be cast into. See NumPy docs for more detail on possible types.
class TorchDatasetPreprocessor

Abstract base for any preprocessor that results in a PyTorch Dataset.

Ancestors

  • abc.ABC

Subclasses

Methods

def read_file(self, asset: str | pathlib.Path | Package) -> torch.utils.data.dataset.Dataset

Reads a file from disk and coerces it into a Torch.Dataset.

Args

asset (Union(str, Path, or Package)): File (or Package) pointing to the asset on disk.

Returns

torch.utils.data.Dataset
A PyTorch Dataset