Module tripleblind.table_asset

Specialized Asset representing tabular data, like a spreadsheet.

The TableAsset wraps a generic asset, allowing the complexity of creating jobs to be completely hidden. Common operations can happen with just a few lines of code.

For example:

import tripleblind as tb

table1 = tb.TableAsset(company_1_customers)
table2 = tb.TableAsset(company_2_clients)

# Privately determine overlap between above remote tables and a local CSV
overlap = table1.intersect(
    intersect_with=[table2, "broker-licenses.csv"],
    match_column="ssn")
if overlap:
    overlap.print_content()

Classes

class StatFunc (value, names=None, *, module=None, qualname=None, type=None, start=1)

Enumeration of statistical functions supported by get_statistics()

Ancestors

  • enum.Enum

Class variables

var CONFIDENCE_INTERVAL

The 95% confidence interval for the sample mean. Labeled 'ci-lower' and 'ci-upper'.

var COUNT

The number of items in the data. Labeled 'n'.

var KURTOSIS

The kurtosis of the data, which is the combined weight of a distribution's tails relative to the center of the distribution. Labeled 'kurt'.

var MAXIMUM

The maximum value in the data. Labeled 'max'.

var MEAN

The arithmetic mean of the data. Labeled 'mean'.

var MEDIAN

The median value in the data. Labeled 'median'.

var MINIMUM

The minimum value in the data. Labeled 'min'.

var QUARTILES

The q1 and q3 quartiles of the data plus median. Labeled 'q1', 'mean' ,'q3'.

var SKEW

The skew (degree of asymmetry) of the data. Labeled 'skew'.

var STANDARD_DEVIATION

The standard deviation (dispersion relative to the mean) of the data. Labeled 'sd'.

var STANDARD_ERROR

The standard error for the sample mean. Labeled 'se'.

var VARIANCE

The variance (spread of the numbers) of the data. Labeled 'var'.

class TableAsset (uuid: UUID)

A specialized Asset which holds spreadsheet-style data containing rows and columns.

NOTE: Operations may download the data temporarily. This is done in a secure fashion, but local data should be treated with care.

Ancestors

Static methods

def cast(asset: Asset) -> TableAsset

Convert a generic Asset into a TableAsset

This should only be used on data known to be tabular, no validation occurs during the cast.

Args

asset : Asset
A generic Asset

Returns

TableAsset
A TableAsset object
def find(search: Optional[Union[str, re.Pattern]], namespace: Optional[UUID] = None, owned: Optional[bool] = False, owned_by: Optional[int] = None, session: Optional[Session] = None, exact_match: Optional[bool] = True) -> TableAsset

Search the Router index for an asset matching the given search

Args

search : str or re.Pattern, optional
Either an asset ID or a search pattern applied to asset names and descriptions. A simple string will match a substring or the entire string if exact_match is True, or a regular expression can be passed for complex searches.
namespace : UUID, optional
The UUID of the user to which this asset belongs. None indicates any user, NAMESPACE_DEFAULT_USER indicates the current API user.
owned : bool, optional
Only return owned assets (either personally or by the current user's team)
owned_by : int, optional
Only return owned assets owned by the given team ID
session : Session, optional
A connection session. If not specified, the default session is used.
exact_match : bool, optional
When the 'search' is a string, setting this to True will perform an exact match. Ignored for regex patterns, defaults to True.

Raises

TripleblindAssetError
Thrown when multiple assets are found which match the search.

Returns

TableAsset
A single asset, or None if no match found
def find_linked_records(datasets: List[Union[Asset, TableAsset, str]], match_columns: List[Union[str, Tuple(str, str)]], match_threshold: Optional[float] = 0.88, job_name: Optional[str] = None, preproc: Optional[Union[TabularPreprocessor, List[TabularPreprocessor], TabularPreprocessorBuilder, List[TabularPreprocessorBuilder]]] = None, silent: Optional[bool] = False) -> TableAsset

Find matching records between datasets

Args

dataset : Union[List[Asset], List[TableAsset]]
The two data assets to examine for linked records.
match_columns (List[Union[str, Tuple(str, str)]]): Names of the
columns to match. List of common column names or tuples
containing the column names to be paired for the linkage.
E.g. if the first dataset has columns named "fname", "lname" and
"age" and the second dataset has a columns named "first", "last"
and "age", then the columns would be specified as:
[('fname', 'first'), ('lname', 'last'), 'age')].
match_threshold : float, optional
The minimum confidence for match values. This is a value between 0.0 and 1.0. Default is 0.88.
job_name : str, optional
Reference name for this process. This name will appear in the Access Request, Process History, and Audit Reports.
preproc : Union[TabularPreprocessor, List[TabularPreprocessor]], optional
The preprocessor(s) to use against the datasets. When no preprocessor is specified, the default preprocessor selects all match columns and renames columns for linkage.
silent : bool, optional
Suppress status messages during execution? Default is to show messages.

Raises

TripleblindProcessError
Thrown when a column in return_columns is masked or when k-Grouping validation fails (output contains less than k records).
ValueError
Thrown when a parameter validation fails.

Returns

TableAsset
The output contains rows from the first dataset which had matches in the second dataset. Following those rows, the second dataset matches are reported in the match order. Both include the cosine similarity of the match.
def position(file_handle: Union[str, Path, Package, io.BufferedReader], name: str, desc: str, is_discoverable: Optional[bool] = False, allow_overwrite: Optional[bool] = False, session: Optional[Session] = None, is_dataset: bool = True, custom_protocol: Optional[Any] = None, metadata: dict = {}, k_grouping: int = 5, unmask_columns: Optional[List[str]] = None) -> TableAsset

Place CSV data on your Access Point for use in by yourself or others.

CSV Import Rules:

  • The separator must be a comma, not semicolons or tabs.
  • First row must be a header row containing column names.
    • No padding is allowed on names, extra spaces will be treated as part of the name.
    • Names can optionally be enclosed in double-quotes ('"').
  • Legitimate values:
    • The special numeric values 'inf', '-inf', 'nan' are supported. Capitalization is ignored (e.g. INF and InF are both legal)
    • Boolean values can be 'true' or 'false', case is ignored
    • Numeric values can be either integers (no decimals) or floating point (with a decimal).
    • Scientific notation (e.g. 1.23e04 to represent 12300.0) can be use for floating point values.
    • String values can be enclosed in double-quotes ('"') or not, although string containing commas must be within a quoted string.
  • Column type is inferred from the entire column's content. The following will impact this process:
    • Missing values mixed in with booleans becomes a string column
    • Any floating point value mixed in with integers makes a floating point column
    • Infinite, NaN or missing values in a column with integers makes a floating point column
    • Leading spaces before booleans, or the special values -Inf, Inf, or NaN will result in a string column
    • Missing values are treated as NaN in floating columns (and will also force an integer column to floating point)
  • Column types in dataframes:
    • Floating point columns are reported as 'np.float64'
    • Integer columns are reported as 'np.int64'
    • Boolean columns are reported as 'np.bool'
    • String columns are reported as 'object'
  • Leading spaces get included in string columns, but will be trimmed from simple numbers
  • Quotes are optional for alpha-numerics but required for most strings with special characters (including commas)

Args

file_handle : str, Path, Package or io.BufferedReader
File handle or path to CSV data to place on the API user's associated Access Point.
name : str
Name of the new asset.
desc : str
Description of the new asset (can include markdown)
is_discoverable : bool, optional
Should this asset be listed in the Router index to be found and used by others?
allow_overwrite : bool, optional
If False an exception will be thrown if the asset name already exists. If True, an existing asset will be overwritten.
session : Session, optional
A connection session. If not specified, the default session is used.
is_dataset : bool
Is this a dataset? (False == algorithm)
custom_protocol
Internal use
metadata : dict
Custom metadata to include in the asset
k_grouping : int
The minimum number of distinct values
unmask_columns : [str], optional
List of field names that should be unmasked. Default to to mask all fields.

Returns

TableAsset
New asset on the Router, or None on failure

Instance variables

var dataframe : pd.DataFrame

The content of the asset loaded as a Pandas dataframe

Returns

pd.DataFrame
The asset data
var raw_content : str

The content of the asset loaded as a string

Returns

str
The content

Methods

def blind_join(self, intersect_with: Union[Asset, List[Asset], TableAsset, List[TableAsset]], match_column: Union[str, List[str]], return_columns: Union[str, List[str], List[List[str]]], match_fuzziness: Optional[float] = None, job_name: Optional[str] = None, preproc: Optional[Union[TabularPreprocessor, List[TabularPreprocessor], TabularPreprocessorBuilder, List[TabularPreprocessorBuilder]]] = None, silent: Optional[bool] = False, join_type: JoinType = JoinType.INNER) -> TableAsset

The intersection of datasets using exact or fuzzy matching.

The first dataset must be yours, so this method can only be invoked on an asset your team owns.

The 'match_fuzziness' is an inverted Jaro-Winkler distance. In a Jaro-Winkler comparison, a score of 0.0 is totally different and 1.0 is perfectly matched. Therefore 'match_fuzziness' = 0.0 is a perfect match and 'match_fuzziness' = 1.0 is completely different.

Args

intersect_with : Union[Asset, List[Asset]]
A dataset Asset (or a list of Assets) to be included in the Blind Join.
match_column : Union[str, List[str]]
Name of the column to match. If not the same in all datasets, a list of the matching column names, starting with the initiator asset and then listing a name in each intersect_with dataset.
match_fuzziness : float, optional
If specified, the fuzziness threshold for match values. This is a value between 0.0 and 1.0. Default is 0.0 (perfect match).
return_columns : Union[str, List[str], List[List[str]]]

Name of the columns to be returned in the resultant dataset. This is a list of lists of column names such that each party has its own list of column names (starting, as with match_column, with the initiator asset).

str: If a single column name is provided, columns matching that name across all datasets will be returned. eg. ['a'] is interpreted as [ ['a'], ['a'], ['a'] ]

List[str]: If a list of names is provided, columns matching these names across all datasets will be returned. eg. ['a', 'b'] is interpreted as [ ['a', 'b'], ['a', 'b'], ['a', 'b'] ]

List[List[str]]: If a list of lists is provided, only columns matching the names from the respective dataset will be returned. eg. [ ['a', 'b'], ['c', 'd'] ] is interpreted as [ ['a', 'b'], ['c', 'd'] ]

NOTE: If a column name is specified for an asset and it does not exist, it is ignored. If multiple datasets have the same name for a column, the output dataset will contain suffixes to differentiate the source of each field. E.g. If "name" is in multiple datasets, the return will contain "name", "name__1" & "name__2".

job_name : str, optional
Reference name for this process. This name will appear in the Access Request, Process History, and Audit Reports.
preproc : Union[TabularPreprocessor, List[TabularPreprocessor]], optional
The preprocessor(s) to use against the datasets. When no preprocessor is specified, the default preprocessor selects all columns.
silent : bool, optional
Suppress status messages during execution? Default is to show messages.
join_type : JoinType, optional

Describes the format of the resulting dataset table. Default is INNER.

INNER: This is a SQL-like inner join where each party's return columns are included in the result set.

INNER_PARTITIONED: This is an inner join with each party's matching rows appended to the end of the dataset. There is no alignment by matching column. This format provides more privacy-protection and reveals less group-membership information than a standard inner join.

LEFT: This is a SQL-like left join where ALL records are returned from the initiator's table, with matching records from each party's return columns (where available).

Raises

TripleblindProcessError
Thrown when a column in return_columns is masked or when k-Grouping validation fails (output contains less than k records).

Returns

TableAsset
A table of matching rows from all datasets. Rows from your table with multiple matches are duplicated such that all column values from matching rows in the second dataset are returned, one per row. If the resulting table is empty, this means there were no records found in the intersection of the datasets.
def detect_outlier(self, identifier_column: str, columns: Union[str, List[str]], std_dev_limit: int = 3, job_name: Optional[str] = None, silent: Optional[bool] = False, session: Optional[Session] = None) -> List[str]

Identify rows in a dataset containing unusual values, or outliers

This operation privately analyzes the given dataset values. Each value is then compared to the mean value for the dataset, identifying rows where the value is outside the number of standard deviations given. This is known as the Z Score. For a normal distribution, 99.7% of the values will be within 3 standard deviations of the mean.

Args

identifier_column : str
Name of the column in the table containing the record identifier, e.g. "id"
columns : Union[str, List[str]]
Name of the column(s) to check for outliers.
std_dev_limit : int
The number of standard deviations outside of which a value is considered an outlier. Default is 3, which captures 99.7% of a normally distributed dataset.
job_name : str, optional
Reference name for the job with performs this task.
silent : bool, optional
Suppress status messages during execution? Default is to show messages.
session : Session, optional
A connection session. If not specified, the default session is used.

Returns

List[str]
List of identifiers for rows containing outliers.
def get_column_names(self) -> list

Fetch Column Names for a Given Asset

This should only be used on data known to be tabular. Relies on find to pass along asset.

Args

self : TableAsset
A generic TableAsset - utilize uuid

Returns

List
A list of column names or an empty list
def get_mock_data(self, session: Optional[Session] = None) -> pandas.core.frame.DataFrame

Retrieve 10 rows of sample data typical of the dataset Args:

Returns

pd.DataFrame
A table of the example data
def get_sample(self, num_samples: Optional[int] = 10, silent: Optional[bool] = False, session: Optional[Session] = None) -> pandas.core.frame.DataFrame

Generates synthetic data typical of the dataset

Result is randomly generated data similar to the actual dataset.

Args

num_samples : int, optional
Number of synthetic samples to generate. Default is 10.
silent : bool, optional
Suppress status messages during execution? Default is to show messages.
session : Session, optional
A connection session. If not specified, the default session is used.

Returns

pd.DataFrame
A table of the synthetic data
def get_statistics(self, column: Union[str, List[str]], function: Union[StatFunc, List[StatFunc]] = None, combine_with: Optional[Union[Asset, List[Asset]]] = None, group_by: Optional[str] = None, preproc: Optional[Union[TabularPreprocessor, List[TabularPreprocessor], TabularPreprocessorBuilder, List[TabularPreprocessorBuilder]]] = None, job_name: Optional[str] = None, silent: Optional[bool] = False, session: Optional[Session] = None, match_column: Optional[Union[str, list[str]]] = None) -> TableAsset

Calculate one or more statistics on the database.

NOTE: String column types will be coerced to float32 when calculating statistics, if possible. Any non-numeric string in the column will result in an error.

Args

column : str, List[str]
Name of data column(s) upon which to calculate. Column names must be common between all of the data.
function : StatFunc, List[StatFunc], optional
Function(s) to calculate. If not specified, all stats are calculated.
combine_with : Union[Asset, List[Asset]], optional
Other table(s) with the same data/columns to virtually combine for the calculation.
group_by : str, optional
Data column for grouping of data before the calculation.
preproc : TabularPreprocessor, List[TabularPreprocessor], TabularPreprocessorBuilder, List[TabularPreprocessorBuilder], optional
The preprocessor(s) to use against datasets. If a list is given, the order must be the same as the combine_with assuming the first entry is this TableAsset.
job_name : str, optional
Reference name for the job with performs this task.
silent : bool, optional
Suppress status messages during execution? Default is to show messages.
session : Session, optional
A connection session. If not specified, the default session is used.
match_column : Union[str, List[str]], optional
If not provided, the data is treated as horizontally partitioned. If provided, the data is treated as vertically partitioned and match_column is the name of the column to match. If not the same in all datasets, a list of the matching column names, starting with the initiator asset and then listing a name in each intersect_with dataset.

Returns

TableAsset
A table of the computed statistics.
def intersect(self, intersect_with: Union[Asset, str, Path, Package, List[Union[Asset, str, Path, Package]]], match_column: str, job_name: Optional[str] = None, silent: Optional[bool] = False, session: Optional[Session] = None) -> TableAsset

Perform a Private Set Intersection

Private Set Intersection (PSI) allows you to privately compare data
you own with other datasets.  The first dataset must be yours, so
this method can only be invoked on an asset your team owns.

Args

intersect_with : Union[Asset, List[Asset]]
Other table(s) to intersect against. Can be Assets or paths to local files.
match_column : str
The name of the colum to match on each table
job_name : str, optional
Reference name for the job with performs this task.
silent : bool, optional
Suppress status messages during execution? Default is to show messages.
session : Session, optional
A connection session. If not specified, the default session is used.

Returns

TableAsset
The intersection of the datasets
def load(self, sep: Optional[str] = None, col_names: Optional[List[str]] = None, header: Optional[bool] = True) -> pandas.core.frame.DataFrame

Retrieve and load the asset into a Pandas dataframe

Args

sep : str, optional
Character separating values. Default is comma.
col_names : [str], optional
List of column names. Default is no names.
header : bool, optional
Read first line as a header? Default is True.
def mask_columns(self, col_names: Union[str, List[str]], mask_type: MockType = MockType.random, params: Optional[Dict[str, Any]] = None, session: Optional[Session] = None)

Masks columns identified by the supplied list of MaskColumns.

Args

col_names : List or str
Column name or list of names to mask.
mask_type : MockType
The mock type to use when generating data.
params : Dict[str, Any], optional
Parameters used by the mask_type. For example, for MockType.email it could be 'params={"domain": "gmail.com"}'.
session : Session, optional
A connection session. If not specified, the default session is used.

Returns

True if the operation succeeded, otherwise false.

def pretty_print(self, col_names: Optional[List[str]] = None)

Display the table nicely.

Args

col_names : [str], optional
Names to display for each column
def print_content(self, num_cols: Optional[int] = None, indent: Optional[int] = None, show_rows: Optional[bool] = False)

Print the contents of a TableAsset

NOTE: This downloads the data temporarily. This is done in a secure fashion, but local data should be treated with care.

Args

num_cols : int, optional
Number of columns for output. Default is single column.
indent : int, optional
Number of spaces to indent output. Default is zero.
show_rows : bool, optional
Should row numbers be displayed? Default is no row numbers.
def search(self, search_column: str, pattern: List[Union[str, re.Pattern]], match_case: bool = True, job_name: Optional[str] = None, silent: Optional[bool] = False, session: Optional[Session] = None) -> TableAsset

Perform string searches against rows of the given column

Args

search_column : str
The name of the column to search.
pattern : Union[str, re.Pattern]
Search pattern(s). These can be simple strings or full regex patterns.
match_case : bool, optional
For a simple string search, should the case be matched? Defaults to True.
job_name : str, optional
A reference name
silent : bool, optional
Suppress status messages during execution? Default is to show messages.
session : Session, optional
A connection session. If not specified, the default session is used.

Returns

TableAsset
The search results, each on its own row
def unmask_columns(self, col_names: Union[str, List[str]], session: Optional[Session] = None) -> bool

Unmasks columns identified by the supplied list of col_names.

Args

col_names : List or str
Column name or list of names to unmask.
session : Session, optional
A connection session. If not specified, the default session is used.

Returns

True if the operation succeeded, otherwise false.

Inherited members