Module `preprocessor.package`

A Package is a single file used to hold one or more files of data. The Package is essentially a .zip archive with several specific files inside it to define metadata about the package. Each package contains a file named ".meta.json" as well as the data itself.

Functions

def connect_sqlalchemy(engine, max_retries=4)

Connect to a SQLAlchemy engine with retries and exponential backoff

Args

engine : sa.engine.Engine: The engine to connect to
max_retries : int: The maximum number of times to retry the connection

Returns

sa.engine.Connection: The connection

def generate_mustache_mock_context(template, params: List[ReportParameter])

For validating a user-provided sql/elasticsearch template, use their param definiton to generate a mock mustache context. This can be used with chevron to attempt to populate it and warn user of any errors (happens outside this function).

def validate_report_template_params(query_template: str, params: List[ReportParameter] | Dict[str, ReportParameter], verbose=False)

Validate a user-provided sql/elasticsearch template, using their param definiton to generate a mock mustache context. Attempt to populate it with chevron.

Returns

query_template rendered with mock context

Classes

class Package (path: pathlib.Path, meta: Meta, spec: Spec | None)

A collection of data for training or computations, along with descriptions of the contents. A Package is essentially an archive (.zip) of files following a special internal structure:

An example Package with simple tabular data would internally look like: filename.zip .meta.json # describes version, creation date, etc some_kind_of_data.csv # the data in this package

Image Package files also contain an internal "records.csv" which associates information such as training labels with images within the package. An example Image Package file would internally look like: filename.zip .meta.json # describes version, creation date, etc records.csv # index of images and labels (for training) images img_001.jpg img_002.jpg

Packages can also contain info to authenticate and query a database, etc.

Class variables

var MANIFEST_FILE
var META_FILE
var SPEC_FILE

Static methods

Create a Package using a simple CSV or a CSV describing a folder layout

For the simple case, just define the record_data (the CSV file) and optionally the header list if the first row of the CSV does not hold the name of the columns.

For the more complex case of a folder layout, the CSV is a list of files and must contain several specific columns. * path_column (required): Name of the column holding the associated data file path/filenames * root (optional): The root from which the above paths are relative. If None, paths are relative to the CSV itself. * label_column (optional): Name of the column holding a label describing each file. If there are multiple labels per file, use JSON format to specify the list.

Args

filename : Union[str, Path]: Filename of the Package to create
record_data : Union[str, Path]: Filename of data used to populate this Package.
root : Union[str, Path], optional: Path to the root of the data folder. Default is None
path_column : str, optional: Name of the column in the record_data file which contains paths to data files. If None, the record_data is treated as a simple tabular data file.
label_column : str, optional: Name of the label column. When a path_column exists, this column holds labels associated with the file in the path_column. Multi-label datasets need to be in JSON format.
header : List[str], optional: A list of column names. If None, the first row of the CSV will be used as a header.
spec_override (List[FieldOverride]) = [],
is_masked : bool: Whether or not the data is masked.
unmask_columns : [str], optional: List of individual fields to unmask.
supplemental_named_paths : [Dict[str,str]], optional: This is a dictionary of name:path indicating files to be included in the package

Returns

Package: The archive object

Validate and create a database report Package.

Args

filename : Union[str, Path]: Filename of the package to be created.
query_template : str: The SQL/Elastic query template containing Mustache template parameters for the report.
params : Dict[str, ReportParameter]: Parameters for this report.
connection : str: The SQLAlchemy compliant connection string that defines where the database resides, as well as how to authenticate with it. See: [https://docs.sqlalchemy.org/core/connections.html]
connection_opts : dict, optional: Dictionary of database connection options.
credentials_info : dict, optional: Dictionary of credentials information if not provided in the connection string.
federation_group : Union[str, UUID, None], optional: The federation group to use when running this report. If present, overrides connection.
aggregation_template : str, optional: The result aggregation to use when running a federated report.
post_processing_script : str, Optional: A Python function or the filename containing a function to run after the report has been executed. The function must have the signature: def postprocess(df: pd.Dataframe, ctx: dict) The two arguments are the report output data frame and a dict holding the user-selected report parameters as context.
json_to_dataframe_script : str, Optional: A Python function or the filename containing a function to convert the JSON output of individual federation members to a pandas dataframe.
name : str, optional: The name of the report.
description : str, optional: The description of the report.

Raises

Exception: Query template must not be blank.
Exception: Report parameters are required.
Exception: Invalid param_type for ReportParameter …
Exception: PARAM does not appear in the query template.
Exception: PARAM missing from params
Exception: Name must be unique for each parameter, PARAM reused.

Returns

Package: The created report package object

def create_elastic_search_query(filename: str | pathlib.Path, connection: str, api_key: str, index: str, body: dict, return_type: str, store_type: str)

def create_elastic_search_report(filename: str | pathlib.Path, name: str, description: str, post_processing_script: str, body_template: str, params: dict, connection: str, index: str, api_key: str) -> Package

def create_from_database(filename: str | pathlib.Path, query: str, connection: str) -> Package

Define a package extracted from a database-held dataset.

Args

filename : Union[str, Path]: Filename of the package to be created.
query : str: The SQL query used to collect the dataset.
connection : str: The SQLAlchemy compliant connection string that defines where the database resides, as well as how to authenticate with it. See: [https://docs.sqlalchemy.org/core/connections.html]

Returns

Package: The archive object

def create_neo4j_query(filename: str | pathlib.Path, connection: str, username: str, password: str, query: str) -> Package

def from_aws_s3_bucket_storage(filename: str | pathlib.Path, bucket_name: str, region: str, object_name: str, aws_access_key_id: str, aws_secret_access_key: str) -> Package

Create a package file referencing an AWS S3 Bucket Storage data file

Args

filename : Union[str, Path]: Filename of the package to be created.
bucket_name : str: Name of the AWS S3 Bucket containing the data file
region : str: The AWS region
object_name : str: The file name, know as object or key in AWS S3
aws_access_key_id : str: Access key for this account, region, bucket
aws_secret_access_key : str: Secret access key for this account, region, bucket

Returns

Package: The created Package object

def from_azure_blob_storage(filename: str | pathlib.Path, storage_account_name: str, storage_key: str, file_system: str, key: str) -> Package

Create a package file referencing a Azure Blob Storage data file

Args

filename : Union[str, Path]: Filename of the package to be created.
storage_account_name : str: The Azure storage account to reference.
storage_key : str: Access token used when pulling files from the storage account.
file_system : str: File system defined in the Azure control panel for the storage account.
path : str: The full path to the file that will be downloaded.

Returns

Package: The created Package object

def from_azure_data_lake_storage(filename: str | pathlib.Path, storage_account_name: str, storage_key: str, file_system: str, path: str) -> Package

Create a package file referencing a Azure Data Lake Storage data file

Args

filename : Union[str, Path]: Filename of the package to be created.
storage_account_name : str: The Azure storage account to reference.
storage_key : str: Access token used when pulling files from the storage account.
file_system : str: File system defined in the Azure control panel for the storage account.
path : str: The full path to the file that will be downloaded.

Returns

Package: The created Package object

def from_generic_result(package_name: str | pathlib.Path, file_data_dict: dict, job_id: str, job_type: str) -> Package

def from_image_dataset_folder(output_zip: str | pathlib.Path, path: str | pathlib.Path) -> Package

Create package from torch style image dataset folder structure

NOTE: labels must be numeric values

Assumes structure: path/ / imgs / imgs

Args

output_zip : Union[str, Path]: Path of output zipfile
path : Union[str,Path]: Path to folder structure

Returns

(Package): Package file holding the given input data.

def from_json(filename: str | pathlib.Path, data: Or(, )) -> Package

Create package from JSON file.

Args

filename : Union[str, Path]: Filename of the package to be created.
data : str: Filename of the data to package

Returns

Package: The created Package object

def from_model(filename: str | pathlib.Path, model_type: str, model_path: Or(, ), unrestricted_data: List[str] = [], reports: ModelReportRecord | None = None, logs: List[str] = [], data_transformers_path: str | pathlib.Path = '', target_transformers_path: str | pathlib.Path = '', vocab_path: str | pathlib.Path = '', target_map_path: str | pathlib.Path = '', validation_hash: str | None = None) -> Package

Create package from model file.

Args

filename : Union[str, Path]: Filename of the package to create
model_type : str: Model format, i.e. "torch", "keras", etc.
model_path : str: Current location of model to archive

Returns

Package: The created Package object

def from_model_folder(filename: str | pathlib.Path, model_path: str | pathlib.Path, manifest: Dict) -> Package

Create package from model directory.

Args

filename : Union[str, Path]: Filename of the package to be created.
model_path : Union[str, Path]: Path to the model directory
manifest : Dict: The manifest of the model

Returns

Package: The created Package object

Prepare a single data file from numpy as an appropriately structured Package

Args

output_zip : Union[str, Path]: Path of Package to create
X : np.array: Training data
y : np.array: Training labels

Returns

(Package): Package file holding the given input data.

def from_report_result(filename: str | pathlib.Path, result: pandas.core.frame.DataFrame, manifest: Dict) -> Package

Create a package file referencing a report result

Args

filename : Union[str, Path]: Filename of the package to be created.
result : pd.DataFrame: The result of the report to be packaged.
manifest : Dict: The manifest of the report result

Returns

Package: The created Package object

def from_single_file(output: str | pathlib.Path, input: str | pathlib.Path, is_masked: bool = True, unmask_columns: List[str] | None = None) -> Tuple[Package, bool]

Prepare a single data file as an appropriately structured Package

Args

output : Union[str, Path]: Path of Package to create
input : Union[str, Path]: Path of data to be placed into the Package
is_masked : bool: Whether or not the data is masked.
unmask_columns : [str], optional: List of column names that are unmasked. Default to to mask all columns.

Raises

Exception: Unable to ascertain the proper Package to hold the data

Returns

(Package, bool): Package file holding the given input data, plus a boolean indicating if it is a package of images.

def from_word_embedding(filename: str | pathlib.Path, embedding_path: str | pathlib.Path, vocab_path: str | pathlib.Path = '') -> Package

Create word embedding package for training

Args

filename : Union[str, Path]: Filename of the package to be created
embedding_path : Union[str, Path]: Path to the source word embedding
vocab_path : Union[str, Path]: Path to the source vocabulary for embedding

Returns

Package: The created Package object

def load(path: str | pathlib.Path, validation_hash: str | None = None) -> Package

Instantiates a Package object from a file

Args

path : Union[str, Path]: The file, must already be in package format
validation_hash : Optional[str], optional: The expected hash of the

package contents. The router stores the hash at asset registration. Defaults to None. Will bypass check.

Returns

Package: A Package instance

def reference_from_database(filename: str | pathlib.Path, query: str, connection: str, options: dict | None = None, credentials_info: dict | None = None, linked_storage_columns: dict | None = None) -> Package

Define a package referring to a database-held dataset.

Args

filename : Union[str, Path]: Filename of the package to be created.
query : str: The SQL query which will be used to collect the dataset.
connection : str: The SQLAlchemy compliant connection string that defines where the database resides, as well as how to authenticate with it. See: [https://docs.sqlalchemy.org/core/connections.html]
options : Optional, dict: Dictionary of database connection options.
credentials_info : Optional, dict: Dictionary of credentials information if not provided in the connection string.
linked_storage_columns : Optional, dict: Dictionary of linked storage columns.

Returns

Package: The archive object

def reference_from_mongo_database(filename: str | pathlib.Path, query: str, connection: str, database: str, collection: str, projection: dict = {}, limit: int | None = None, sort: List | None = None) -> Package

Define a package referring to a database-held dataset.

Args

filename : Union[str, Path]: Filename of the package to be created.
query : str: JSON dictionary which is compatible with pymongo.
connection : str: Mongo connection uri. See: [https://docs.mongodb.com/manual/reference/connection-string/]

Returns

Package: The archive object

Instance variables

var filename : str
var meta : Meta
var path : pathlib.Path
var spec : Spec

Methods

def create_sqlalchemy_engine(self) -> sqlalchemy.engine.base.Engine

Create a SQLAlchemy engine for the package's database

Returns

sqlalchemy.engine.Engine: The engine

def get_data_transforms(self)

def get_linked_dicom(self, column_name: str, path_value: str, convert_to=None)

Get DICOM image data from linked storage and optionally convert to a specific format.

Args

column_name : str: The name of the column that references cloud storage
path_value : str: The value in the column that refers to the specific item
convert_to : str, optional: Format to convert the image to ('PIL', 'numpy', 'bytes', 'pydicom') Default is None, which returns bytes.

Returns

Union[bytes, PIL.Image.Image, numpy.ndarray, pydicom.dataset.FileDataset]: The DICOM data in the requested format

Raises

ValueError: If the column is not defined as a linked storage column
ImportError: If required packages for conversion are missing

def get_linked_image(self, column_name: str, path_value: str, convert_to=None)

Get image data from linked storage and optionally convert to a specific format.

Args

column_name : str: The name of the column that references cloud storage
path_value : str: The value in the column that refers to the specific item
convert_to : str, optional: Format to convert the image to ('PIL', 'numpy', 'bytes') Default is None, which returns bytes.

Returns

Union[bytes, PIL.Image.Image, numpy.ndarray]: The image data in the requested format

Raises

ValueError: If the column is not defined as a linked storage column
ImportError: If required packages for conversion are missing

def get_manifest_html(self)

def get_model_misclassifications(self) -> List[pandas.core.frame.DataFrame]

Get information about failed test cases during model training

If the final model failed to produce correct results for any of the labeled test data, a sample of "unrestricted" information about those failures is returned. The unrestricted data is declared by the dataset owner when they mark a data column as Unmasked.

A maximum of 10 records are returned per client.

Returns

List[pd.Dataframe]: Dataframes holding the unmasked data for failures, up to one per client

def get_package_type(self) -> PackageType

Get the category of the packaged data

Returns

PackageType: An indication of the content of the package

def get_target_mapping(self)

def get_target_transforms(self)

def get_vocab(self)

def get_word_embedding(self)

def hash_contents(self)

Hashes the contents of the Package

For all Package types, hash the concatenated CRC-32 values of files in the package, excluding spec files. For database variant Packages, the database query is stored in metadata and also gets hashed.

Note: The hash is not stored in the Package.

Returns

string: hexdigest of sha256 hash

def iter_records(self) -> PackageIterator

def model(self)

Extract model contained in package to memory

Possible model_types include: * keras * pytorch * sklearn * recommender * onnx: A ModelProto object * xgboost: xgboost.XGBClassifier or xgboost.XGBRegressor * pmml_regression: A privophy.RegressionModel or privophy.GeneralRegressionModel object * pmml_tree * network_builder: JSON describing the TripleBlind model (e.g. split NN, vertical network)

Returns

Or[Pytorch, Keras, SKlearn, XGBoost, Recommender models, PMMLRegression, JSON]: The model

def model_pointer(self) -> Tuple[str, object]

Return model type and file pointer directly to model file path inside of the zip file. model_types include: keras, pytorch, sklearn, recommender, and xgboost

Returns

Tuple[model_type as string, zip file pointer to model path]

def perform_database_report(self, report_values) -> pandas.core.frame.DataFrame

def perform_elastic_search_report(self, report_values)

def populate_report_template(self, report_values) -> str

def populate_spec(self, force: bool = False)

def record_data(self) -> pandas.core.frame.DataFrame

def record_data_as_file(self)

def records(self) -> pandas.core.frame.DataFrame

def records_chunked(self, chunksize: int) -> Iterator[pandas.core.frame.DataFrame]

def regenerate_spec(self)

def resolve_linked_storage(self, column_name: str, path_value: str) -> bytes

Resolve a linked storage item to its actual data.

Args

column_name : str: The name of the column that references cloud storage
path_value : str: The value in the column that refers to the specific item

Returns

bytes: The content of the cloud storage item

Raises

ValueError: If the column is not defined as a linked storage column or if the storage type is not supported

async def substitute_connection_secrets(self, secret_store)

Use the Access Point provided secret store to replace handlebar variables in connection strings.

def substitute_connection_secrets_sync(self, secret_store)

def validate_db_connection(self)

def validate_sql(self)

Run an SQL linter on the query to validate syntax.

Raises

ValueError: Failed mustache rendering.
ValueError: Failed SQLFluff linter. Content is a list of error strings in the format: ["Line {line_no}, Position {column_no}: {error message}", …]

class PackageIterator (parent: Package, zip: zipfile.ZipFile, df: pandas.core.frame.DataFrame)

Helper for walking through the contents of a Package file

Ancestors

collections.abc.Iterator
collections.abc.Iterable
typing.Generic

class PackageType (value, names=None, *, module=None, qualname=None, type=None, start=1)

An enumeration.

Ancestors

enum.Enum

Class variables

var AWS_S3_BUCKET_STORAGE
var AZURE_DATA_LAKE_STORAGE
var CSV
var DATABASE_REPORT
var ELASTIC_SEARCH
var ELASTIC_SEARCH_REPORT
var GENERIC_RESULT
var JSON
var MODEL
var MODEL_FOLDER
var MONGO
var NEO4J_QUERY
var REPORT_RESULT
var SQL
var WORD_EMBEDDING