Blind Report

Blind Report allows you to position a database-backed query with predefined configurable parameters. Users can configure the query using these predefined options, have it run against your database, and receive a report table.

This is a powerful operation that allows the data steward to permit only specific, controlled access to the data they desire exposed to the consumer. For example, a report could be defined that allows a user to select a year and month along with a company division for generating salary statistics by ethnicity or gender for usage in compliance reporting.

Any number of variables and any complexity of queries are supported. See the examples/Blind_Report for documentation and more information.

Blind Report is a Safe operation (see Privacy Assurances and Risk in the Getting Started section of the User Guide).

Operation

See the ReportAsset documentation for positioning methods.
When using add_agreement() to forge an agreement for a counterparty to use the Blind Report, use the positioned asset’s UUID for the operation parameter.
When using create_job() to run the report in a process, use the positioned asset for the operation parameter.

Parameters

Positioning parameters

Blind Reports are positioned using create methods that accept connection details similar to their DatabaseDataset counterparts. Additional parameters include:

query_template: str

The query template uses {{brackets}} to identify which parameters will be exposed as configurable by a user.

params: List[ReportParameter]

ReportParameter methods (create_string, create_float, & create_int) should be used to generate the acceptable format.
Configurable options should be added using ParameterOption.

Report parameters

operation: ReportAsset

When running the Blind Report, the Asset UUID of the positioned report should be supplied here as the algorithm to be run.

dataset: []

This should be left blank when running a Blind Report, ie. dataset=[].

params: Dict{"report_values": {"param_1": "value_1"}, …, {}}

This is a JSON string supplying the desired parameters to be run within the Blind Report job.
Use get_report_params() to understand the configurable parameters and their options.

Limitations

Blind Report is not supported for file-based assets like CSVs or Amazon S3.
Blind Report is not supported for MongoDB assets.
This operation does not permit the use of sql_transform preprocessors by the data user.

k-Grouping

Operations that return data (eg. Blind Query, Blind Join, & Blind Stats) usually have embedded k-Grouping safeguards that reduce the risk of leaking data when there are less than a specific threshold of records comprising a group or the total output of an operation. Unlike these operations, Blind Report is not protected by k-Grouping in an automated way, as it is fully defined by the data owner.

ℹ️As a best practice, we encourage using a SQL 🔗HAVING clause to enact a purposeful k-Grouping safeguard within the parameterized query in your Blind Report. For instance, the query in the example script (examples/Blind_Report/1_position_bigquery_report.py) is:

query_template = """
SELECT Dept_Name, {{demographic}}, AVG({{pay}}) as average_{{pay}} from tripleblind_datasets.city_of_somerville_payroll
GROUP BY Dept_Name, {{demographic}};
"""

This can be modified to respect a k-Grouping safeguard by introducing a clause to only return groups with more than a certain amount of records:

query_template = """
SELECT Dept_Name, {{demographic}}, AVG({{pay}}) as average_{{pay}} from tripleblind_datasets.city_of_somerville_payroll
GROUP BY Dept_Name, {{demographic}}
HAVING COUNT({{demographic}}) >= 5;
"""

With this clause, you ensure that each group contains at least 5 members, and the report is less likely to inadvertently provide information for a malicious actor to discern potentially personally-identifiable information from its contents (eg. returning the average salary of only a single individual).

Federated Blind Report

A Federated Blind Report allows a report to be created from a group of data providers, not just a single data source. The full details of the data can be utilized to produce aggregate values across all the data. Aggregation is enforced, ensuring that privacy is maintained for even the most sensitive of data.

A Federated Blind Report looks much like a simple Blind Report to the consumer. One party acts as the Aggregator and the others are simple Data Providers. The Aggregator creates and manages the actual reports, and each Data Provider has full control over which reports they wish to participate in. Aggregators can also provide data or can simply act as the manager for all of the Data Providers.

Federated Blind Reports are a novel tool that requires multiple steps to execute correctly, which can be overwhelming. This page is intended to give a broad overview of all the steps, for a detailed walkthrough of the process see the SDK’s demos/Hospital_Data_Federation folder, where you can follow a working code example while reading this page.

Using Multiple Data Sources

The creator of a Blind Report will write one query which is run against multiple data sources to generate the report. Let’s show a visual example of this first, and then explain the steps a Blind Report creator would take.

Creating source assets

Each source Asset in the diagram is a standard TripleBlind Database Asset, such as an MSSQLDatabase or OracleDatabase.

These requirements are important:

The data asset must be a Database Asset, not a simple data asset like a CSV file or S3 Bucket Asset.
The data within each Data Provider must come from a single database, although it can come from different tables within that database.
The Database Assets created by the Data Providers must adhere to the standard views defined by the Aggregator.

Additionally, each Data Provider will need to create agreements with the Aggregator to ensure smooth performance when using a Federated Blind Report.

Defining the Query Template

The Aggregator will define a standard view for all the data they will need to create their reports. This standard will be implemented by each Data Provider as a translation between their local definitions and the standard. For example, their database might have a field “gender” but the standard calls for a “patient” view which has a field called “sex” instead. The Data Provider will simply need to include “sex as gender” in their DatabaseAsset definition’s SQL in order to perform this translation.

Notice in the diagram above that Data Provider 1 created two different MSSQLDatabase assets on their Access Point. Data Provider 2 did the same to produce these same standard views for their database using OracleDatabase assets. Now the Aggregator can create the Federated Blind Report’s query_template which will operate against these views. The query_template of the Blind Report will reference these two tables as its only sources. For example, a report could use the following simplified query:

query_template = """
    SELECT p.sex, bv.cost
    FROM Patient p
    JOIN BilledVisit bv ON p.ID = bv.ID
    WHERE bv.encountercode = {{code}}
"""

When the report is run, TripleBlind expands the query template into a full query for each FederatedMember, something like this:

full_query = """
    WITH Patient AS (
        < view definition from the data provider's "Patient" asset >
    ), BilledVisit AS (
        < view definition from the data provider's "Billed Visit" asset >
    )
    SELECT p.sex, bv.cost
    FROM Patient p
    JOIN BilledVisit bv ON p.ID = bv.ID
    WHERE bv.encountercode = {{code}}
"""

Notice in the example above that there are no GROUP BY, COUNT or other aggregation techniques. This is because TripleBlind handles aggregation in a separate step to allow aggregation to capture detail from the entire cohort, as described in detail below.

Accounting for Database Variants

Typically it is possible to write one common SQL statement that will run against participant databases. However there are situations where SQL implementations differ and specific code needs to be used for a particular SQL implementation. For example the DATEDIFF operation in Microsoft SQL is equivalent to DATE_PART in Postgres and also has a different set of parameters in Oracle. TripleBlind defines a DIALECT_XXX value at execution time (see the list here) to allow you to write queries which account for these differences within your templates. Here is an example query which deals with these DATEDIFF/DATE_PART variations:

query_template = """
    SELECT p.sex, bv.cost,
    {{#DIALECT_MSSQL}}
        DATEDIFF(Day, bv.DischargeDate, bv.AdmitDate) AS LengthOfStay
    {{/DIALECT_MSSQL}}
    {{#DIALECT_POSTGRESS}}
        DATE_PART('day', bv.DischargeDate - bv.AdmitDate) AS LengthOfStay
    {{/DIALECT_POSTGRESS}}
    {{#DIALECT_ORACLE}}
        DATEDIFF('DD', bv.DischargeDate, bv.AdmitDate) AS LengthOfStay
    {{/DIALECT_ORACLE}}
    FROM Patient p
    JOIN BilledVisit bv ON p.ID = bv.ID
    WHERE bv.encountercode = {{code}}
"""

Elasticsearch in Federated Reports

Federated Reports now allow for Elasticsearch-backed Federation Members. Each Federation Member must either have a set of SQL-type Assets or a single Elasticsearch Asset, but there may be a single report for which some members use SQL-type Assets and others use Elasticsearch.

To create such a hybrid report, use the DIALECT_ELASTIC and DIALECT_XXX mustache tags to create two separate parts of the template.

While for an SQL-type Federation Member, the final query is composed of the individual Assets within WITHs and the Report Query Template appended, for an Elasticsearch Federation Member the entire query body is replaced with the Report Query Template.

A typical Elasticsearch query returns two output parts, hits and aggregations. When creating an ElasticsearchDataset, the Data Owner can control how these outputs will be distributed if the Asset is used in a Federated Report using the ElasticReturnType controls. By default, only aggregations will be returned to the Report Initiator while hits will be returned to the Data Owner. Use these settings:

elastic_asset = tb.asset.ElasticsearchDataset.create(
    ...,
    return_type=ElasticReturnType.AGG_JSON,  # default
    store_type=ElasticReturnType.FULL_JSON,  # default
)

When a Federated Report is run the Data Owner receives a new Asset on their Access Point containing the manifest and, if configured, hits files. The name of the Asset will contain the job_id of the process that created it. This allows private data to be identified by the report user, but not exposed to anyone but the Data Owner. Owners and report users can later coordinate to further collaborate with these distributed data cohorts.

Aggregation is also more complicated for Elasticsearch-backed Federated Reports, as the return is in a json format. To be able to process it, TripleBlind requires the Report Creator to provide an additional json_to_dataframe_script when setting up the report. A simple example script may look like this:

json_post_processing_script = """
def postprocess(input, ctx=None):
    import pandas as pd
    count = input["total_count"]["value"]
    df = pd.DataFrame([count])
    return df
"""

As with other reports, the results will be concatenated and aggregated from each data source (see details below). Since reports containing only Elasticsearch outputs return only aggregations by default, the aggregation step may be skipped with

agg_template = tb.report_asset.AggregationRules.create(in_query=True)

Federation Members and Groups

The report creator is also responsible for bringing together the appropriate Assets from each data provider and structuring the Blind Report input. The two tools for this purpose are the FederationMember and FederationGroup collections. They are created like this:

member_1 = tb.FederationMember.create(
    name="Hope Valley Hospital",  # Data Provider 1, org # 123
    assets={
        "Patient": tb.Asset.find("Hospital A - Patient", owned_by=123),
        "BilledVisit": tb.Asset.find("Hospital A - Billed Visit", owned_by=123),
    },
)
member_2 = tb.FederationMember.create(
    name="Black Hill Hospital",  # Data Provider 2, org # 456
    assets={
        "Patient": tb.Asset.find("Hospital B - Patient", owned_by=456),
        "BilledVisit": tb.Asset.find("Hospital B - Billed Visit", owned_by=456),
    },
)

tb.FederationGroup.create(
    "Demo Federation Group",
    members=[member_1, member_2],
)

The dictionary of assets connects the various Data Providers’ unique assets into a cohort which can be referenced when creating a Blind Report via the FederationGroup. This group definition can also be updated later, so new members can be added without altering any reports which have already been created.

Aggregation

TripleBlind does not allow any non-aggregated data as output of a Federated Blind Report. There are two pieces in the Blind Report definition that help the report creator to define the aggregation steps. First is the type of aggregation to perform -- things like reporting simple counts or calculating a mean. Second is applying an optional grouping for these statistics. Finally, you can also specify an ordering direction.

When the various data are brought together, TripleBlind will first concatenate the rows from each source (this is why a common data structure is important), then apply any aggregates rules, then group_by rules and finally enforce the ordering. Because different databases might or might not enforce capitalization of column names, but pandas is case-sensitive, it is highly advised to use lower-case aliases all throughout your query_template definition.

See an example of the AggregationRules definition:

agg_template = tb.report_asset.AggregationRules.create(
   group_by=["VisitType"],
   aggregates={"LengthOfStay": "mean", "ICD10CM": "count"},
   sort_order="asc" # or "desc"
)

Note: Elasticsearch does not return a typical table that could easily be aggregated with SQL-like sources, so the Report Creator must provide an additional conversion mechanism in the form of the json_to_dataframe_script parameter. Read more about this in the Elasticsearch section above.

K-Grouping

K-grouping is used in the Federated Blind Report in a specific way. Namely, it is assumed that if more than one Data Provider is selected, the obscurity of the data source protects small numbers of records that would otherwise be rejected by the k-grouping mechanism. Specifically, k-grouping works as follows:

If more than one data provider is selected by the end user, k-grouping is automatically set to 1.
If a single data provider is selected, the highest k-grouping among the settings of all the Assets of that provider will apply:
- If no group_by columns are provided, the report will return an empty result if the total number of records is lower than k.
- If any group_by columns are provided, the report will filter out all groups with a total number of records lower than k, and return all other groups.

Customizing Report Output

After aggregation has been performed, you can optionally define a Python postprocessing function to customize the output. A postprocessing function will receive as input a pandas dataframe containing the calculated aggregate values, and a context dictionary containing information about the options selected when running this report.

def postprocess(input: "pd.Dataframe", ctx: dict):
   input['LengthOfStay'] = input['LengthOfStay'].abs()
   input.rename(columns={'LengthOfStay': 'Mean Length Of Stay (Days)'}, inplace=True)
   input.rename(columns={'ICD10CM': Number of Patients}, inplace=True)
   return input

The ctx dictionary has available to it a variety of information, namely details about who ran the report and what parameters they selected:

"name": package.meta.record.name,
"description": package.meta.record.description,     # str
"initiator_details": job_params.initiator_details,  # Dict[str, str]
"attributes": {
    "report_values": display_params,        # Dict[str, str]
    "raw_values": raw_params,               # Dict[str, str]
    "federation_members": fed_members,      # List[str] (only for federated reports)

If no postprocessing is provided, the output of the blind report is simply the aggregation dataframe.

Manifest file

To simplify the work of the report creator having to always write a supplemental postprocessing script just to display the ctx information, TripleBlind always provides a manifest.html file as part of the report output zip archive. This should be sufficient for the standard need to provide a report output to an end user alongside basic information about how it was obtained. The manifest cannot currently be further edited by the Report Creator. Here is an example of a manifest.html file:

Creating the Report

The report writer will bring all these parts together when they use DatabaseReport.create(). This looks very much like creating a simple Blind Report, but with the addition of two parameters: federation_group and federation_aggregation.

query_template = """
SELECT
   bv.VisitType,
   bv.BillingCode AS ICD10CM,
   DATEDIFF(Day, bv.DischargeDate, bv.AdmitDate) AS LengthOfStay
FROM
   Patient pv
JOIN
   BilledVisit bv ON pv.PatientID = bv.PatientID
WHERE
   {{ICDcode}}  -- expands to bv.BillingCode=<selected code 1>
"""

agg_template = tb.report_asset.AggregationRules.create(
   group_by=["VisitType"],
   aggregates={"LengthOfStay": "mean", "ICD10CM": "count"},
   sort_order="asc" # or "desc"
)

post_processing_script = """
def postprocess(input,ctx):
   input['LengthOfStay'] = input['LengthOfStay'].abs()
   input.rename(columns={'LengthOfStay': 'Mean Length Of Stay (Days)'}, inplace=True)
   input.rename(columns={'ICD10CM': Number of Patients}, inplace=True)
   return input
"""

# Define a code parameter 
icd_code_param = tb.report_asset.ReportParameter.create_code(
   name="ICDcode",
   display="Filter on ICD code",
   description="Choose an ICD code to report on",
   systems=["icd9", "icd10"],
   comparison_column="bv.BillingCode",
)

# Define the Federated Blind Report
blind_report = tb.report_asset.DatabaseReport.create(
   name="Readmission rates by diagnosis code",
   query_template=query_template,
   federation_group=tb.FederationGroup.find(name="Demo Federation Group"),
   federation_aggregation=agg_template,
   post_processing=post_processing_script,
   params=[icd_code_param],
)

The given query_template will be invoked using the parameters selected by the user (from the form defined by params) for all of the selected members of the federation_group. After the federation_aggregation is applied, the post_processing_script will customize the output to be included in the final report asset. The final zip archive will also include the manifest.html file with information about the run parameters.

Agreements

TripleBlind’s permissioning system continues to enforce strict access controls for all data owners. Both the Aggregator and the Data Providers will need to define agreements to create seamless operation for the report users.

Data Providers must create an agreement with the Aggregator to make their standard view assets visible to the Aggregator.
After each report has been created, Data Providers must create an agreement with the Aggregator to allow the report to run against their views.
Aggregators will need to create agreements or make reports public in order for them to be visible to the report runners.

Tips and Gotchas

Some small things can be tricky when creating a Blind Report. Here are things to watch out for:

Database systems aren’t consistent in the capitalization of output column names, they may range from all-lowercase to as-requested. The safest approach is to add AS lowercase_alias to all columns in your final SELECT statement.

To single out one weird dialect, you can use the mustache templating language conditional:

    {{#DIALECT_X}}
        -- some SQL specific to dialect X
    {{/DIALECT_X}}
    {{^DIALECT_X}}
        -- SQL for all dialects except dialect X
    {{/DIALECT_X}}

A Report Creator can, for example, let the user select a GROUP BY column from ["dbo.pv.Sex", "dbo.pv.Income"] as an optional parameter, but then rename it in the final dataframe to conditionally on the selection, e.g. "Gender" or "Socioeconomic Background". The way you can achieve this is by using the "display" value on ParameterOption:

demographic_param = tb.report_asset.ReportParameter.create_string(
    name="demographic",
    display="Select a GROUP-BY demographic",
    options=[
        tb.report_asset.ParameterOption("dbo.pv.Sex", "Gender"),
        tb.report_asset.ParameterOption("dbo.pv.Income", "SocioEconomic Background"),
    ],
)

And then collecting it from ctx in the postprocessing script:

post_processing_script = """
def postprocess(input, ctx):
    input.rename(columns={
        'demographic': ctx["attributes"]["report_values"]["Select a GROUP-BY demographic"]
    }, inplace=True)
    return input
"""

Mon Oct 14 2024 10:53:10 GMT-0400 (Eastern Daylight Time)