Model Security

Model Training

Federated Learning

Federated learning is a decentralized machine learning technique that trains a model across multiple data providers each holding independent datasets. Typically this involves sending the model to each participant to initialize model training independently, and then averaging the activations of wait at a central point (without having to move or share their datasets). Our implementation of federated learning is seamlessly integrated with our access management, privacy controls, and auditing systems.

Split Learning

Split Learning, developed at MIT Media Lab’s Camera Culture group, is a machine learning approach that allows several entities to train machine learning models without sharing any raw data. Split Learning provides a promising solution to enable industry, research, and academic collaborations across a wide range of domains including healthcare, finance, security, logistics, governance, operations and manufacturing.

The underlying approach of Split Learning is to split a deep learning model (i.e., architecture) across the participating, distributed entities. Each participating entity trains their part of the model on premises, and only shares the output of the last layer of their model (referred to as smashed data or split layer output). Therefore, unlike other distributed learning approaches that require sharing the entire model and its parameters, Split Learning allows sharing the output of the split layer only, which drastically reduces the amount of shared knowledge (i.e., trained models) and thus reduces information leakage. Additionally, different flavors of Split Learning limit the amount and type of data that can be shared from the split layer, such as data labels, for added privacy.

Existing research studies prove that Split Learning is (1) beneficial for training deep learning models where data sharing is prohibited, such as hospitals; (2) can be implemented to support data multimodalities and different tasks across participating entities; (3) improves computational and communication efficiency at participating entities compared to other techniques, such as Federated Learning; and (4) promises to reduced data leakage from the split layer.

Blind Learning

Blind Learning is the name of our privacy-preserving algorithm for training neural networks on distributed datasets (both vertically and horizontally). Blind Learning is built on top of Split Learning.

Blind Learning provides two guarantees:

  • Data never leaves the data-owner's possession
  • The complete model is never revealed to the data owner during the training

The training process involves exchanging outputs and gradients of one intermediate layer, the split layer. Other approaches, such as Federated Learning, exchange the entire model's parameters -- sacrificing protection for the model amongst the organizations involved in training. Blind Learning provides additional privacy guarantees to the data exchanged from the split layer during the training process. In particular, parties are prevented from reconstructing data from the outputs of the split layer.

Our approach mimics the Model Averaging technique used in Federated Learning to synchronize the model parameters across the participating entities. Therefore, we are able to demonstrably reduce the possibility of data leakage, since model updates are averaged across all participants. Blind Learning supports parallel training among involved parties, which reduces the overall training time.

Blind De-Identification

Blind De-Identification automatically de-identifies any type of data at the byte-level. This non-traditional approach renders any type of data de-identified, including genomic, image, tabular, voice, etc.

Blind Decorrelation

Blind Decorrelation is an optional parameter to obscure the relationship between a model’s input data and model parameters. This reduces the possibility of a training set membership attack on AI models. Without decorrelation a membership inference attack could potentially allow a malicious user to discern data which was part of the training dataset. If a member (record) in the training set can be inferred from the model, the model could reveal a patient’s identity at a fairly high level of probability.

Model Inference

Federated Inference

The model is securely transmitted to the data’s location only for the duration of the inference operation. This operation supports only two parties.

Distributed Inference

A more secure form of Federated Inference is used on vertically-partitioned models. Each provider receives only the part of the algorithm necessary to do inference on its data, keeping the overall model secure. This can include partitioned data that span organizations. When attached to live data, this would allow a diagnosis to incorporate information from a hospital, an imaging company, and even financial records in-place.

SMPC Inference

Model inference using Secure Multi-Party Computation (SMPC) offers the strongest level of protection, both for the data and for the model. No recoverable version of the data or the model is ever exchanged between the parties. Instead, a one-way transformation is applied to partial shares of the model and the data, which allows computations to be performed in an irreversible SMPC-transformed space. No encryption key exists that can be compromised, and SMPC is mathematically proven to be quantum safe, meaning that a bad actor with unlimited computational resources would be unable to compromise the system.

Wed May 15 2024 03:05:02 GMT-0400 (Eastern Daylight Time)