Module preprocessor.tabular
Classes
class Column (name: str, aliases: List[str], target: bool)
-
Column(name: str, aliases: List[str], target: bool)
Ancestors
- IsDict
- abc.ABC
Class variables
var SCHEMA
var aliases : List[str]
var name : str
var target : bool
Inherited members
class TabularNumpyPreprocessor (columns: List[Column], all_columns: bool, sql_transform: str | None, python_transform: str | None, dtype: str | numpy.dtype | None, sk_data_transformers: List | None, sk_target_transformers: List | None, expand_input_dims: Tuple[int, int] | None, handle_nan: str | int | float | dict | None, substitutions: list | tuple | dict | None)
-
Preprocessor designed for tabular-style data tasks.
This family of Preprocessors is designed to manage tabular-style data, e.g. rows of data with named columns for fields. The sequence of operations is fixed to this order:
- SQL transform OR Python transform
- Column selection (via all_columns or an explicit column list)
- Value replacements
- Handling of NaN values
- Other transforms (OneHotEncoder, etc)
- Numpy type casting
This order is followed regardless of the order in which operations are specified. Hence, the following behave identically:
tb.TabularPreprocessor.builder() .add_column("target", target=True) .all_columns(True) .sql_transform("SELECT target, salar as sal, amt FROM data") .dtype("float32") tb.TabularPreprocessor.builder() .sql_transform("SELECT target, salar as sal, amt FROM data") .dtype("float32") .all_columns(True) .add_column("target", target=True)
Ancestors
- TabularPreprocessor
- NumpyPreprocessor
- abc.ABC
Methods
def read_asset_generator(self, asset: str | pathlib.Path | Package, batch_size: int = 32) -> Iterator[numpy.ndarray]
Inherited members
class TabularNumpyTargetPreprocessor (columns: List[Column], all_columns: bool, sql_transform: str | None, python_transform: str | None, dtype: str | numpy.dtype | None, sk_data_transformers: List, sk_target_transformers: List, expand_input_dims: Tuple[int, int] | None, handle_nan: str | int | float | dict | None, substitutions: tuple | dict | None)
-
Preprocessor designed for tabular-style data tasks.
This family of Preprocessors is designed to manage tabular-style data, e.g. rows of data with named columns for fields. The sequence of operations is fixed to this order:
- SQL transform OR Python transform
- Column selection (via all_columns or an explicit column list)
- Value replacements
- Handling of NaN values
- Other transforms (OneHotEncoder, etc)
- Numpy type casting
This order is followed regardless of the order in which operations are specified. Hence, the following behave identically:
tb.TabularPreprocessor.builder() .add_column("target", target=True) .all_columns(True) .sql_transform("SELECT target, salar as sal, amt FROM data") .dtype("float32") tb.TabularPreprocessor.builder() .sql_transform("SELECT target, salar as sal, amt FROM data") .dtype("float32") .all_columns(True) .add_column("target", target=True)
Ancestors
Inherited members
class TabularPandasPreprocessor (columns: List[Column], all_columns: bool, sql_transform: str | None, python_transform: str | None, sk_data_transformers: List, sk_target_transformers: List, handle_nan: str | int | float | dict | None, substitutions: list | tuple | dict | None)
-
Preprocessor designed for tabular-style data tasks.
This family of Preprocessors is designed to manage tabular-style data, e.g. rows of data with named columns for fields. The sequence of operations is fixed to this order:
- SQL transform OR Python transform
- Column selection (via all_columns or an explicit column list)
- Value replacements
- Handling of NaN values
- Other transforms (OneHotEncoder, etc)
- Numpy type casting
This order is followed regardless of the order in which operations are specified. Hence, the following behave identically:
tb.TabularPreprocessor.builder() .add_column("target", target=True) .all_columns(True) .sql_transform("SELECT target, salar as sal, amt FROM data") .dtype("float32") tb.TabularPreprocessor.builder() .sql_transform("SELECT target, salar as sal, amt FROM data") .dtype("float32") .all_columns(True) .add_column("target", target=True)
Ancestors
Inherited members
class TabularPreprocessor (columns: List[Column], all_columns: bool, sql_transform: str | None, python_transform: str | None, sk_data_transformers: List, sk_target_transformers: List, handle_nan: str | int | float | dict | None, substitutions: list | tuple | dict | None)
-
Preprocessor designed for tabular-style data tasks.
This family of Preprocessors is designed to manage tabular-style data, e.g. rows of data with named columns for fields. The sequence of operations is fixed to this order:
- SQL transform OR Python transform
- Column selection (via all_columns or an explicit column list)
- Value replacements
- Handling of NaN values
- Other transforms (OneHotEncoder, etc)
- Numpy type casting
This order is followed regardless of the order in which operations are specified. Hence, the following behave identically:
tb.TabularPreprocessor.builder() .add_column("target", target=True) .all_columns(True) .sql_transform("SELECT target, salar as sal, amt FROM data") .dtype("float32") tb.TabularPreprocessor.builder() .sql_transform("SELECT target, salar as sal, amt FROM data") .dtype("float32") .all_columns(True) .add_column("target", target=True)
Subclasses
- TabularNumpyPreprocessor
- TabularNumpyTargetPreprocessor
- TabularPandasPreprocessor
- TabularTorchPreprocessor
Static methods
def builder() -> TabularPreprocessorBuilder
Instance variables
var optional_target_column_name : str | None
-
Any defined target column name, or None if no target.
Raises
MultipleTargetColumns
- Multiple targets were defined
Returns
Optional[str]
- Any defined target column name, or None if no target
var target_column_name : str
-
The target column name
Raises
MultipleTargetColumns
- Multiple targets were defined
MissingTargetColumn
- No target was defined
Returns
str
- The string name of the target column
Methods
def dealias(self, alias: str) -> str
-
Convert aliases into actual name.
Args
alias
- The alias of a column name.
Returns
str
- The actual name of an alias, or the unchanged string if no alias found.
def pandas_coerce(self, data: pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame
-
Apply parts of TabularPreprocessor to the data supplied.
Specifically applies all_columns, add_column, sql_transform, and python_transform.
Args
data
:pd.Dataframe
- data on which to apply preprocessor.
script(str, Optional): contents of python script to apply to dataframe through python transform.
Returns
DataFrame
- A dataframe with parts of preprocessor applied.
def set_sk_fitted_data_transform(self, transformers)
-
Set the sk fitted data transform to the provided transformer(s)
Args
transformers
- the provided transformers
def set_sk_fitted_target_transform(self, transformers)
-
Set the sk fitted target transform to the provided transformer(s)
Args
transformers
- the provided transformers
def sk_data_transform(self) -> Tuple[object, bool]
-
Get the sk fitted data transformers and whether to fit the dataset
Returns
Tuple[object, bool]
- The transformers and an indicator for whether or not to fit the dataset
def sk_target_transform(self) -> Tuple[object, bool]
-
Get the sk fitted target transformers and an indicator of whether to fit the target
Returns
Tuple[object, bool]
- The transformers and an indicator for whether or not to fit the target
class TabularPreprocessorBuilder
-
Abstract base for a preprocessor that can output data as a numpy.ndarray
Ancestors
Class variables
var SCHEMA
Methods
def add_column(self, name: str | List[str], target: bool = False) -> TabularPreprocessorBuilder
-
Add column(s) to the list of columns to include in this operation
Args
name
:str, List[str]
- Name or list of column names to include. If a list is passed, additional names are treated as aliases. To include multiple columns, use the method multiple times.
target
:bool
- Is this a target column? Target columns are used in operations such as training a model. Inference operations typically do not need a target and will ignore it if set.
Returns
TabularPreprocessorBuilder
- This class instance, useful for chaining.
def add_data_transformer(self, transform: ForwardRef('OneHotEncoder') | ForwardRef('OrdinalEncoder') | ForwardRef('KBinsDiscretizer') | ForwardRef('MultiLabelBinarizer'), columns: List | str = [], params: Dict[str, object] = {}) -> TabularPreprocessorBuilder
-
Define a data transformer for feature/independent variables.
Args
transform
:str
- The transformation to be applied to the specified
column. Currently supported:
OneHotEncoder
,KBinsDiscretizer
,OrdinalEncoder
columns
:str
orList
- The column(s) to which the specified transformation will be applied.
params
:dict
- A dictionary of parameters specific to the transformation specified by the transform parameter. Specific parameters can be found in scikit learn documentation. For each transform type, all scikit learn parameters are supported except for 'sparse' in OneHotEncoder. See https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
Returns
TabularPreprocessorBuilder
- This class instance, useful for chaining.
def add_target_transformer(self, transform: ForwardRef('OneHotEncoder') | ForwardRef('OrdinalEncoder') | ForwardRef('KBinsDiscretizer') | ForwardRef('MultiLabelBinarizer'), columns: List | str = [], params: Dict[str, object] = {}) -> TabularPreprocessorBuilder
-
Define a transformer for target/dependent variables.
Args
transform
:str
- The transformation to be applied to the specified target column.
columns
:str
orList
- The column(s) to which the specified transformation will be applied.
params
:dict
- A dictionary of parameters specific to the transformation specified by the transform parameter. Specific parameters can be found in scikit learn documentation. For each transform type, all scikit learn parameters are supported except for 'sparse' in OneHotEncoder. See https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
Returns
TabularPreprocessorBuilder
- This class instance, useful for chaining.
def all_columns(self, value: bool = True) -> TabularPreprocessorBuilder
-
Turn on or off using all columns in dataset.
Args
value
- Indicates whether or not to use all columns in the dataset.
Returns
TabularPreprocessorBuilder
- This class instance, useful for chaining.
def dtype(self, dtype: str | None) -> TabularPreprocessorBuilder
-
Cast an output numpy array to a given dtype.
If not explicitly set, the protocol will choose the dtype. This is ignored for non-Numpy outputs.
Args
dtype
- The dtype that a numpy output will be cast into.
Returns
TabularPreprocessorBuilder
- This class instance, useful for chaining.
def expand_input_dims(self, axis: Tuple[int, int])
def expand_target_dims(self, expand=True)
def handle_nan(self, method: str | int | float | dict) -> TabularPreprocessorBuilder
-
Specify method for handling NaN (not a number) values found in the data
Usage examples::
# Drop rows with NaNs present in any column: tb.TabularPreprocessor.builder().all_columns(True). handle_nan("drop") # Set NaNs to the value zero tb.TabularPreprocessor.builder().all_columns(True). handle_nan(0) # Specify different methods for difference columns. Fill NaNs found # in column "A" with the median value of the column, and fill NaNs # found in column "B" with the value 42: tb.TabularPreprocessor.builder().all_columns(True). handle_nan({"A": "median", "B": 42})
Args
method
:str, int, float
ordict
-
Method of simple replacement for any NaN found in the data, or a dict specifying the method or replacement for specific fields.
drop
- Drop all rows where NaN is present (see df.dropna())mean
/median
/min
/max
- Replace NaN values with the given calculated statistic for the column.int or float - Replace NaN values with the given value. (see df.fillna())
Returns
TabularPreprocessorBuilder
- This class instance, useful for chaining.
def python_transform(self, script: str | pathlib.Path) -> TabularPreprocessorBuilder
-
Set a Python script to transform a data asset before use using a dataframe.
The provided script must have the form::
tb.TabularPreprocessor.builder().all_columns(True). python_transform( ''' import pandas as pd def transform(df: pd.DataFrame) -> pd.DataFrame: # transform the dataframe as you'd like return df ''' )
For security reasons only Pandas and Numpy can be imported.
NOTE: Only one python_transform() or sql_transform() can be used on each dataset.
Args
script
:Path
orstr
- Path to a Python script, or a multiline string holding the script (leading whitespace will be intelligently trimmed).
Returns
TabularPreprocessorBuilder
- This class instance, useful for chaining.
def replace(self, substitutions: list | tuple | dict) -> TabularPreprocessorBuilder
-
Replace matching values in whole dataset or specific field with the given value
Usage examples::
# Change -99 to np.nan everywhere tb.TabularPreprocessor.builder().all_columns(True). replace((-99, np.nan)) # Change -99 in column named "A" to NaN, and -1 to column B to zero tb.TabularPreprocessor.builder().all_columns(True). replace({"A": (-99, np.nan), "B": (-1, 0)})
Args
substitutions
:list, tuple, dict
- Change a value matching "from" to a new value When tuple, it is treated as: (from, new_value) When list, it must contain tuples as: [(from, new_value), …] When dict, it is treated as: {"FIELDNAME": (from, new_value), …}
Returns
TabularPreprocessorBuilder
- This class instance, useful for chaining.
def sql_transform(self, query: str | pathlib.Path) -> TabularPreprocessorBuilder
-
Set an SQLite query to apply to a data asset to transform it before use.
The query will be executed against a table named "data". Any valid SQLite method can be use to rename or modify values in this transitory table before the value is used in the operation. For example, this query renames "Y" to "target" and calculates "svr" from the raw value of svr and base::
tb.TabularPreprocessor.builder().all_columns(True). sql_transform( "SELECT Y as target, (svr * base) / 2 as svr, FROM data" )
NOTE: Only one python_transform() or sql_transform() can be used on each dataset.
Args
query
:str
- the sqlite query to run.
Returns
TabularPreprocessorBuilder
- This class instance, useful for chaining.
def target_dtype(self, dtype: str | None) -> TabularPreprocessorBuilder
-
Cast an output target numpy value to a given dtype.
If not set, the protocol will choose the dtype. This is ignored for non-Torch outputs.
Args
dtype
:str
- The dtype that a numpy output into which will be cast.
Returns
TabularPreprocessorBuilder
- This class instance, useful for chaining.
Inherited members
class TabularTorchPreprocessor (columns: List[Column], all_columns: bool, sql_transform: str | None, python_transform: str | None, dtype: str | numpy.dtype | None, expand_target_dims: bool, target_dtype: str | numpy.dtype | None, sk_data_transformers: List, sk_target_transformers: List, expand_input_dims: Tuple[int, int] | None, handle_nan: str | int | float | dict | None, substitutions: list | tuple | dict | None)
-
Preprocessor designed for tabular-style data tasks.
This family of Preprocessors is designed to manage tabular-style data, e.g. rows of data with named columns for fields. The sequence of operations is fixed to this order:
- SQL transform OR Python transform
- Column selection (via all_columns or an explicit column list)
- Value replacements
- Handling of NaN values
- Other transforms (OneHotEncoder, etc)
- Numpy type casting
This order is followed regardless of the order in which operations are specified. Hence, the following behave identically:
tb.TabularPreprocessor.builder() .add_column("target", target=True) .all_columns(True) .sql_transform("SELECT target, salar as sal, amt FROM data") .dtype("float32") tb.TabularPreprocessor.builder() .sql_transform("SELECT target, salar as sal, amt FROM data") .dtype("float32") .all_columns(True) .add_column("target", target=True)
Ancestors
Inherited members