TripleBlind User Guide

Examples

The TripleBlind SDK comes with dozens of example scripts which show the variety of things that can now be done on detailed yet private data. You can find complete example sequences under the examples folder of the SDK. Each example is stand-alone and will walk you through the full process.

Examples will typically include a series of numbered scripts to:

  • Download or create the data used and preprocess it as necessary
  • Place the datasets on several example organizations for the example scenario
  • Perform a private operation
    -- or --
  • Define and train a model on private data
  • Perform inference with the trained model locally
  • Publish the trained model and perform inference on private data

Each example is complete and can be used as a foundation for building your own specific solution.

ℹ️When you upgrade your SDK, a new SDK folder (with new examples or updated capabilities) will replace your current folder. For this reason, it is better to use the SDK for reference only and create your own scripts in a separate folder. See SDK Installation for more detail about configuration.

Here is a list of examples you'll find in the SDK, and more are being added all the time.


Assets

The sections in this example folder include scripts used for creating the various datasets used in other examples. TripleBlind has created these data assets already and hosts them on their "IniTech", "Globex" and "DemoCo" organizations, so most users have no need to run these.

These scripts provide reference to show how different types of data can be associated with a TripleBlind "Asset". These include:

  • Comma-Separated Value (.csv) files
  • Collections of images or other binary data
  • Tables in a database, such as Mongo and BigQuery
  • Tables in a data warehouse, such as Snowflake
  • Complex data in a NumPy binary format

Blind_Join

This example demonstrates the Blind Join operation, a powerful method to perform both exact and fuzzy matching between datasets. In this scenario, one organization runs a transportation system. They want to collaborate with a retailer who has customers that they suspect ride the transportation system. If the overlap in customers can be determined, the retailer can place more advertisements at the most popular stations as targeted marketing to increase sales.

Blind_Report

Blind Report allows you to position a database-backed query with predefined configurable parameters. Users can configure the query using these predefined options, have it run against your database, and receive a report table. In this example, we define a configurable report for payroll report which allows consumers to supply specific demographics for the report. Only information allowed by the query template and parameter definitions are returned. The query is executed by the organization who owns the report and only the results of the report are returned to the consumer.

Blind_Sample

A Blind Sample generates a realistic privacy-preserving sample similar to the real data. Strings are similar lengths, integers are in the same range, and floating point numbers have the same precision. This example demonstrates obtaining a Blind Sample to help the user understand their dataset asset.

Cifar

Using image datasets spread across three different organizations, this example trains a neural network to classify 10 different categories of images using the CIFAR-10 database of 60,000 32x32 color images.

Once trained, the model is used for inference locally and remotely using both FED and SMPC security (see Model Security for more information). This also illustrates batching for inference, useful to manage memory and optimize performance.

CMAPSS_CNN & CMAPSS_NN

These examples train a model that will predict the Remaining Useful Life (RUL) of an engine at a given time step using two different approaches. One example uses a feed forward neural network (CMAPSS_NN) and the other reshapes the data during preprocessing so that a convolutional neural network can be used (CMAPSS_CNN). Both demonstrate how to train models that predict a continuous dependent variable.

Data_Connectors

This is a directory of examples showing connectors to various databases and data warehouse sources. These examples will not run out-of-the-box, as they require specific authentication and connection parameters, but they provide an excellent starting point for you to position asset sources you use at your organization within the TripleBlind platform.

This directory currently shows examples for the below sources. See Data Format Support for a full list of data sources.

  • Amazon Redshift
  • Amazon S3
  • Google BigQuery
  • Microsoft Azure Blob Storage
  • Microsoft Azure Data Lake (Gen2)
  • Microsoft SQL Server
  • MongoDB
  • Snowflake

For a tutorial on setting up these connectors, see Database Assets.

Data_Munging

"Data munging" is the process of manipulating raw data into a different form, making it more useful for a downstream purpose. This doesn't change the source data, rather it transforms it for specific uses.

TripleBlind supports two methods of munging, one by the data owner and one by data consumer. One method uses SQL to perform the munging and the other uses Python.

Federated_Learning

This example illustrates using the Federated Learning approach to train a model that recognizes handwritten digits.

HIPAA_Restricted

This example shows how to position sensitive data so that it may only be accessed in a controlled manner using Blind Query.

Image_Data

This example trains a handwritten digit recognizer using Blind Learning.

LSTM

A generative text model is built and trained in this example, using an LSTM network. Training data is split across multiple organizations. The example includes running the model to generate text output sequences.

Multimodal_AI

This example demonstrates training a model using vertical partitioning, where different data on the same patients is distributed across several organizations. In this case, CT scans in DICOM format are at one organization, and tabular data for the same set of patients is held by a different organization. A distributed training is performed on this data, building a model which predicts patient age. This also illustrates performing distributed inference, running data from two different organizations through the trained model for validation.

Object_Detection

This example uses the Single Shot Multibox Detector object detection method to train a model that can identify one or more regions of interest (boxes) around objects in an input image and classify the found objects.

Outlier_Detection

The Outlier Detection operation performs basic statistical analysis on CSV-style tabular data, identifying rows of data that contain data which deviates from the other data in the table. This simple operation occurs in a privacy preserving manner, never exposing the contents of the dataset in the process. The output is simply record identifiers of the outliers, not raw data.

PMML

Regression

This example demonstrates the use of Predictive Model Markup Language (PMML) to define an algorithm asset. The model used was trained in R and exported using R's built-in PMML export capabilities. The model is placed on the TripleBlind platform as an algorithm asset, which can then be made available to others. We create an Agreement on this asset to allow "organization-three" to run their data against this algorithm without manual intervention. Finally, that organization is able to run its own data against the model to infer loan default rates.

Tree

The Predictive Model Markup Language (PMML) includes the ability to work with tree structures. This example illustrates working with a PMML representation of a Random Forest tree trained on the classic Iris dataset. The model was trained in R using their 🔗randomForest package, then exported as PMML.

The PMML model is turned into a TripleBlind Asset via the PMMLTree.create() method. Then the model is used to infer against a flower.csv dataset in both Federated and fully secure SMPC modes.

Pretrained_NN_Inference

This demonstrates placing a pre-trained neural network (.onnx, .pth or .h5 model) on the TripleBlind platform. After positioning, another organization securely infers against the model using SMPC, protecting both the data and the intellectual property of the neural network owner.

Private_Query (Blind Query)

This example demonstrates a company providing controlled access to extremely sensitive data via a Blind Query. The company has a database containing names, salary, gender, and ethnicity data. For compliance reasons, a summary report is created to show average salary by ethnic group. A second organization is able to run exactly that query with no ability to extract any further information from the database. Each access is also under full permission control and audited.

PSI (Blind Match)

A stand-alone example of Blind Match, similar to the tutorial. A 🔗Private Set Intersection allows organizations to identify whether their records have an overlap without sharing any information about their own dataset. The output of the operation is simply an asset that lists their owned dataset’s matching key values that are also present in the counterparty dataset.

PSI_Vertical_Decision_Tree

This example combines a Blind Match with a Decision Tree operation to create a model across a subset of vertically-partitioned records found in three distinct datasets owned by different organizations for classification (and two organizations for regression examples). The model is then used to perform an inference on test datasets distributed across the data owners. The Blind Match identifies an overlap of matching records across datasets, and a decision tree classification or regression model is trained on the vertically-partitioned intersection. The last step of this example sequence uses the trained model to perform an inference using test data with a similar spread across the participants in the training.

PSI_Vertical_KMeans

This example combines a Blind Match with a K-Means Clustering operation to create a model across a subset of vertically-partitioned records found in two distinct data sets owned by different organizations. The Blind Match identifies an overlap of matching records across datasets, and a K-Means Clustering model is trained on the vertically-partitioned intersection.The example sequence then uses the trained model to identify clusters in the data and lets the initiator visualize the results for easier review.

PSI_Vertical_Partition

This example combines a Blind Match with Blind Learning to create a model across a subset of records found in three distinct datasets. The model is then used to infer a result locally.

Regression

Gene_Regression

Using two distributed datasets of gene data from cancer patients, a logistic regression model is trained. After training, both local and remote inference are performed to "diagnose" and validate the training results.

PSI_Vertical_Regression

This example combines a Blind Match with the vertically-partitioned Regression operation to create a regression model across a subset of vertically-partitioned records found in three distinct datasets owned by different organizations. The model is then used to perform an inference on test datasets distributed across the data owners.

R_Support

These examples demonstrate how TripleBlind's ecosystem can be accessed from the R programming language using the Reticulate interoperability package. Reticulate allows R to seamlessly interact with Python.

Random_Forest

This example trains a Scikit-learn style Random Forest classifier on data distributed across multiple organizations to predict applicants who will be accepted to a university based on their age, GPA, GMAT scores, and years of work experience. It also illustrates using this model locally and in the cloud to predict admittance for another batch of applicants.

Recommendation_Model

This example demonstrates Recommendation Model training and inference operations between distributed datasets. In this scenario, we will utilize a movie ratings dataset which consists of columns: userId, itemId , rating, timestamp with 100 thousand ratings.

Statistics

This example applies the Blind Stats operation to calculate descriptive summary statistics across multiple tabular datasets. This is a powerful privacy-preserving operation that allows a dataset user to understand a study population across multiple datasets, even when the data is in different organizations or regions.

A tabular database is searched using simple and regular expressions without exposing the underlying content of the search fields. Only aggregate counts are returned.

Tabular_Data

Using financial data split across three organizations, this example builds a model to predict customer transaction behavior. This shows TripleBlind's Blind Learning version of Split Learning, providing high speed training and model privacy during this training.

Transfer_Learning

In Transfer Learning, an existing CNN model is loaded onto the platform and augmented with additional training against data housed at another organization.

XGBoost & XGBoost_Regression

XGBoost models are trained on distributed financial data. An XGBoost Regression model is built against unevenly split distributed datasets.

Wed May 15 2024 04:43:13 GMT-0400 (Eastern Daylight Time)