Blind Report
Blind Report allows you to position a database-backed query with predefined configurable parameters. Users can configure the query using these predefined options, have it run against your database, and receive a report table.
This is a powerful operation that allows the data steward to permit only specific, controlled access to the data they desire exposed to the consumer. For example, a report could be defined that allows a user to select a year and month along with a company division for generating salary statistics by ethnicity or gender for usage in compliance reporting.
Any number of variables and any complexity of queries are supported. See the examples/Blind_Report
for documentation and more information.
Blind Report is a Safe operation (see Privacy Assurances and Risk in the Getting Started section of the User Guide).
Operation
- See the
ReportAsset
documentation for positioning methods. - When using
add_agreement()
to forge an agreement for a counterparty to use the Blind Report, use the positioned asset’s UUID for theoperation
parameter. - When using
create_job()
to run the report in a process, use the positioned asset for theoperation
parameter.
Parameters
Positioning parameters
Blind Reports are positioned using create
methods that accept connection details similar to their DatabaseDataset
counterparts. Additional parameters include:
query_template: str
- The query template uses {{brackets}} to identify which parameters will be exposed as configurable by a user.
params: List[ReportParameter]
ReportParameter
methods (create_string
,create_float
, &create_int
) should be used to generate the acceptable format.- Configurable options should be added using
ParameterOption
.
Report parameters
operation: ReportAsset
- When running the Blind Report, the Asset UUID of the positioned report should be supplied here as the algorithm to be run.
dataset: []
- This should be left blank when running a Blind Report, ie.
dataset=[]
.
params: Dict{"report_values": {"param_1": "value_1"}, …, {}}
- This is a JSON string supplying the desired parameters to be run within the Blind Report job.
- Use
get_report_params()
to understand the configurable parameters and their options.
Limitations
- Blind Report is not supported for file-based assets like CSVs or Amazon S3.
- Blind Report is not supported for MongoDB assets.
- This operation does not permit the use of
sql_transform
preprocessors by the data user.
k-Grouping
Operations that return data (eg. Blind Query, Blind Join, & Blind Stats) usually have embedded k-Grouping
safeguards that reduce the risk of leaking data when there are less than a specific threshold of records comprising a group or the total output of an operation. Unlike these operations, Blind Report is not protected by k
-Grouping in an automated way, as it is fully defined by the data owner.
ℹ️As a best practice, we encourage using a SQL 🔗HAVING clause to enact a purposeful k
-Grouping safeguard within the parameterized query in your Blind Report. For instance, the query in the example script (examples/Blind_Report/1_position_bigquery_report.py
) is:
query_template = """
SELECT Dept_Name, {{demographic}}, AVG({{pay}}) as average_{{pay}} from tripleblind_datasets.city_of_somerville_payroll
GROUP BY Dept_Name, {{demographic}};
"""
This can be modified to respect a k
-Grouping safeguard by introducing a clause to only return groups with more than a certain amount of records:
query_template = """
SELECT Dept_Name, {{demographic}}, AVG({{pay}}) as average_{{pay}} from tripleblind_datasets.city_of_somerville_payroll
GROUP BY Dept_Name, {{demographic}}
HAVING COUNT({{demographic}}) >= 5;
"""
With this clause, you ensure that each group contains at least 5 members, and the report is less likely to inadvertently provide information for a malicious actor to discern potentially personally-identifiable information from its contents (eg. returning the average salary of only a single individual).
Federated Blind Report
A Federated Blind Report allows a report to be created from a group of data providers, not just a single data source. The full details of the data can be utilized to produce aggregate values across all the data. Aggregation is enforced, ensuring that privacy is maintained for even the most sensitive of data.
A Federated Blind Report looks much like a simple Blind Report to the consumer. One party acts as the Aggregator and the others are simple Data Providers. The Aggregator creates and manages the actual reports, and each Data Provider has full control over which reports they wish to participate in. Aggregators can also provide data or can simply act as the manager for all of the Data Providers.
Federated Blind Reports are a novel tool that requires multiple steps to execute correctly, which can be overwhelming. This page is intended to give a broad overview of all the steps, for a detailed walkthrough of the process see the SDK’s demos/Hospital_Data_Federation
folder, where you can follow a working code example while reading this page.
Using Multiple Data Sources
The creator of a Blind Report will write one query which is run against multiple data sources to generate the report. Let’s show a visual example of this first, and then explain the steps a Blind Report creator would take.
Creating source assets
Each source Asset in the diagram is a standard TripleBlind Database Asset, such as an MSSQLDatabase
or OracleDatabase
.
These requirements are important:
- The data asset must be a Database Asset, not a simple data asset like a CSV file or S3 Bucket Asset.
- The data within each Data Provider must come from a single database, although it can come from different tables within that database.
- The Database Assets created by the Data Providers must adhere to the standard views defined by the Aggregator.
Additionally, each Data Provider will need to create agreements with the Aggregator to ensure smooth performance when using a Federated Blind Report.
Defining the Query Template
The Aggregator will define a standard view for all the data they will need to create their reports. This standard will be implemented by each Data Provider as a translation between their local definitions and the standard. For example, their database might have a field “gender” but the standard calls for a “patient” view which has a field called “sex” instead. The Data Provider will simply need to include “sex as gender” in their DatabaseAsset definition’s SQL in order to perform this translation.
Notice in the diagram above that Data Provider 1 created two different MSSQLDatabase
assets on their Access Point. Data Provider 2 did the same to produce these same standard views for their database using OracleDatabase
assets. Now the Aggregator can create the Federated Blind Report’s query_template which will operate against these views. The query_template of the Blind Report will reference these two tables as its only sources. For example, a report could use the following simplified query:
query_template = """
SELECT p.sex, bv.cost
FROM Patient p
JOIN BilledVisit bv ON p.ID = bv.ID
WHERE bv.encountercode = {{code}}
"""
When the report is run, TripleBlind expands the query template into a full query for each FederatedMember, something like this:
full_query = """
WITH Patient AS (
< view definition from the data provider's "Patient" asset >
), BilledVisit AS (
< view definition from the data provider's "Billed Visit" asset >
)
SELECT p.sex, bv.cost
FROM Patient p
JOIN BilledVisit bv ON p.ID = bv.ID
WHERE bv.encountercode = {{code}}
"""
Notice in the example above that there are no GROUP BY
, COUNT
or other aggregation techniques. This is because TripleBlind handles aggregation in a separate step to allow aggregation to capture detail from the entire cohort, as described in detail below.
Accounting for Database Variants
Typically it is possible to write one common SQL statement that will run against participant databases. However there are situations where SQL implementations differ and specific code needs to be used for a particular SQL implementation. For example the DATEDIFF
operation in Microsoft SQL is equivalent to DATE_PART
in Postgres and also has a different set of parameters in Oracle. TripleBlind defines a DIALECT_XXX
value at execution time (see the list here) to allow you to write queries which account for these differences within your templates. Here is an example query which deals with these DATEDIFF
/DATE_PART
variations:
query_template = """
SELECT p.sex, bv.cost,
{{#DIALECT_MSSQL}}
DATEDIFF(Day, bv.DischargeDate, bv.AdmitDate) AS LengthOfStay
{{/DIALECT_MSSQL}}
{{#DIALECT_POSTGRESS}}
DATE_PART('day', bv.DischargeDate - bv.AdmitDate) AS LengthOfStay
{{/DIALECT_POSTGRESS}}
{{#DIALECT_ORACLE}}
DATEDIFF('DD', bv.DischargeDate, bv.AdmitDate) AS LengthOfStay
{{/DIALECT_ORACLE}}
FROM Patient p
JOIN BilledVisit bv ON p.ID = bv.ID
WHERE bv.encountercode = {{code}}
"""
Elasticsearch in Federated Reports
Federated Reports now allow for Elasticsearch-backed Federation Members. Each Federation Member must either have a set of SQL-type Assets or a single Elasticsearch Asset, but there may be a single report for which some members use SQL-type Assets and others use Elasticsearch.
To create such a hybrid report, use the DIALECT_ELASTIC
and DIALECT_XXX
mustache tags to create two separate parts of the template.
While for an SQL-type Federation Member, the final query is composed of the individual Assets within WITH
s and the Report Query Template appended, for an Elasticsearch Federation Member the entire query body is replaced with the Report Query Template.
A typical Elasticsearch query returns two output parts, hits
and aggregations
. When creating an ElasticsearchDataset
, the Data Owner can control how these outputs will be distributed if the Asset is used in a Federated Report using the ElasticReturnType
controls. By default, only aggregations
will be returned to the Report Initiator while hits
will be returned to the Data Owner. Use these settings:
elastic_asset = tb.asset.ElasticsearchDataset.create( ..., return_type=ElasticReturnType.AGG_JSON, # default store_type=ElasticReturnType.FULL_JSON, # default )
When a Federated Report is run the Data Owner receives a new Asset on their Access Point containing the manifest
and, if configured, hits
files. The name of the Asset will contain the job_id
of the process that created it. This allows private data to be identified by the report user, but not exposed to anyone but the Data Owner. Owners and report users can later coordinate to further collaborate with these distributed data cohorts.
Aggregation is also more complicated for Elasticsearch-backed Federated Reports, as the return is in a json format. To be able to process it, TripleBlind requires the Report Creator to provide an additional json_to_dataframe_script
when setting up the report. A simple example script may look like this:
json_post_processing_script = """ def postprocess(input, ctx=None): import pandas as pd count = input["total_count"]["value"] df = pd.DataFrame([count]) return df """
As with other reports, the results will be concatenated and aggregated from each data source (see details below). Since reports containing only Elasticsearch outputs return only aggregations by default, the aggregation step may be skipped with
agg_template = tb.report_asset.AggregationRules.create(in_query=True)
Federation Members and Groups
The report creator is also responsible for bringing together the appropriate Assets from each data provider and structuring the Blind Report input. The two tools for this purpose are the FederationMember and FederationGroup collections. They are created like this:
member_1 = tb.FederationMember.create(
name="Hope Valley Hospital", # Data Provider 1, org # 123
assets={
"Patient": tb.Asset.find("Hospital A - Patient", owned_by=123),
"BilledVisit": tb.Asset.find("Hospital A - Billed Visit", owned_by=123),
},
)
member_2 = tb.FederationMember.create(
name="Black Hill Hospital", # Data Provider 2, org # 456
assets={
"Patient": tb.Asset.find("Hospital B - Patient", owned_by=456),
"BilledVisit": tb.Asset.find("Hospital B - Billed Visit", owned_by=456),
},
)
tb.FederationGroup.create(
"Demo Federation Group",
members=[member_1, member_2],
)
The dictionary of assets connects the various Data Providers’ unique assets into a cohort which can be referenced when creating a Blind Report via the FederationGroup. This group definition can also be updated later, so new members can be added without altering any reports which have already been created.
Aggregation
TripleBlind does not allow any non-aggregated data as output of a Federated Blind Report. There are two pieces in the Blind Report definition that help the report creator to define the aggregation steps. First is the type of aggregation to perform -- things like reporting simple counts or calculating a mean. Second is applying an optional grouping for these statistics. Finally, you can also specify an ordering direction.
When the various data are brought together, TripleBlind will first concatenate the rows from each source (this is why a common data structure is important), then apply any aggregates
rules, then group_by
rules and finally enforce the ordering. Because different databases might or might not enforce capitalization of column names, but pandas
is case-sensitive, it is highly advised to use lower-case aliases all throughout your query_template
definition.
See an example of the AggregationRules
definition:
agg_template = tb.report_asset.AggregationRules.create(
group_by=["VisitType"],
aggregates={"LengthOfStay": "mean", "ICD10CM": "count"},
sort_order="asc" # or "desc"
)
Note: Elasticsearch does not return a typical table that could easily be aggregated with SQL-like sources, so the Report Creator must provide an additional conversion mechanism in the form of the json_to_dataframe_script parameter. Read more about this in the Elasticsearch section above.
K-Grouping
K-grouping is used in the Federated Blind Report in a specific way. Namely, it is assumed that if more than one Data Provider is selected, the obscurity of the data source protects small numbers of records that would otherwise be rejected by the k-grouping mechanism. Specifically, k-grouping works as follows:
- If more than one data provider is selected by the end user, k-grouping is automatically set to 1.
- If a single data provider is selected, the highest k-grouping among the settings of all the Assets of that provider will apply:
-
- If no group_by columns are provided, the report will return an empty result if the total number of records is lower than k.
- If any group_by columns are provided, the report will filter out all groups with a total number of records lower than k, and return all other groups.
Customizing Report Output
After aggregation has been performed, you can optionally define a Python postprocessing function to customize the output. A postprocessing function will receive as input a pandas dataframe containing the calculated aggregate values, and a context dictionary containing information about the options selected when running this report.
def postprocess(input: "pd.Dataframe", ctx: dict):
input['LengthOfStay'] = input['LengthOfStay'].abs()
input.rename(columns={'LengthOfStay': 'Mean Length Of Stay (Days)'}, inplace=True)
input.rename(columns={'ICD10CM': Number of Patients}, inplace=True)
return input
The ctx
dictionary has available to it a variety of information, namely details about who ran the report and what parameters they selected:
"name": package.meta.record.name,
"description": package.meta.record.description, # str
"initiator_details": job_params.initiator_details, # Dict[str, str]
"attributes": {
"report_values": display_params, # Dict[str, str]
"raw_values": raw_params, # Dict[str, str]
"federation_members": fed_members, # List[str] (only for federated reports)
If no postprocessing is provided, the output of the blind report is simply the aggregation dataframe.
Manifest file
To simplify the work of the report creator having to always write a supplemental postprocessing script just to display the ctx
information, TripleBlind always provides a manifest.html
file as part of the report output zip archive. This should be sufficient for the standard need to provide a report output to an end user alongside basic information about how it was obtained. The manifest cannot currently be further edited by the Report Creator. Here is an example of a manifest.html
file:
Creating the Report
The report writer will bring all these parts together when they use DatabaseReport.create()
. This looks very much like creating a simple Blind Report, but with the addition of two parameters: federation_group
and federation_aggregation
.
query_template = """
SELECT
bv.VisitType,
bv.BillingCode AS ICD10CM,
DATEDIFF(Day, bv.DischargeDate, bv.AdmitDate) AS LengthOfStay
FROM
Patient pv
JOIN
BilledVisit bv ON pv.PatientID = bv.PatientID
WHERE
{{ICDcode}} -- expands to bv.BillingCode=<selected code 1>
"""
agg_template = tb.report_asset.AggregationRules.create(
group_by=["VisitType"],
aggregates={"LengthOfStay": "mean", "ICD10CM": "count"},
sort_order="asc" # or "desc"
)
post_processing_script = """
def postprocess(input,ctx):
input['LengthOfStay'] = input['LengthOfStay'].abs()
input.rename(columns={'LengthOfStay': 'Mean Length Of Stay (Days)'}, inplace=True)
input.rename(columns={'ICD10CM': Number of Patients}, inplace=True)
return input
"""
# Define a code parameter
icd_code_param = tb.report_asset.ReportParameter.create_code(
name="ICDcode",
display="Filter on ICD code",
description="Choose an ICD code to report on",
systems=["icd9", "icd10"],
comparison_column="bv.BillingCode",
)
# Define the Federated Blind Report
blind_report = tb.report_asset.DatabaseReport.create(
name="Readmission rates by diagnosis code",
query_template=query_template,
federation_group=tb.FederationGroup.find(name="Demo Federation Group"),
federation_aggregation=agg_template,
post_processing=post_processing_script,
params=[icd_code_param],
)
The given query_template
will be invoked using the parameters selected by the user (from the form defined by params
) for all of the selected members of the federation_group
. After the federation_aggregation
is applied, the post_processing_script
will customize the output to be included in the final report asset. The final zip archive will also include the manifest.html
file with information about the run parameters.
Agreements
TripleBlind’s permissioning system continues to enforce strict access controls for all data owners. Both the Aggregator and the Data Providers will need to define agreements to create seamless operation for the report users.
- Data Providers must create an agreement with the Aggregator to make their standard view assets visible to the Aggregator.
- After each report has been created, Data Providers must create an agreement with the Aggregator to allow the report to run against their views.
- Aggregators will need to create agreements or make reports public in order for them to be visible to the report runners.
Tips and Gotchas
Some small things can be tricky when creating a Blind Report. Here are things to watch out for:
- Database systems aren’t consistent in the capitalization of output column names, they may range from all-lowercase to as-requested. The safest approach is to add
AS lowercase_alias
to all columns in your final SELECT statement. - To single out one weird dialect, you can use the mustache templating language conditional:
{{#DIALECT_X}}
-- some SQL specific to dialect X
{{/DIALECT_X}}
{{^DIALECT_X}}
-- SQL for all dialects except dialect X
{{/DIALECT_X}}- A Report Creator can, for example, let the user select a GROUP BY column from ["dbo.pv.Sex", "dbo.pv.Income"] as an optional parameter, but then rename it in the final dataframe to conditionally on the selection, e.g. "Gender" or "Socioeconomic Background". The way you can achieve this is by using the "display" value on ParameterOption:
demographic_param = tb.report_asset.ReportParameter.create_string(
name="demographic",
display="Select a GROUP-BY demographic",
options=[
tb.report_asset.ParameterOption("dbo.pv.Sex", "Gender"),
tb.report_asset.ParameterOption("dbo.pv.Income", "SocioEconomic Background"),
],
)And then collecting it from
ctx
in the postprocessing script:post_processing_script = """
def postprocess(input, ctx):
input.rename(columns={
'demographic': ctx["attributes"]["report_values"]["Select a GROUP-BY demographic"]
}, inplace=True)
return input
"""Mon Oct 14 2024 10:53:10 GMT-0400 (Eastern Daylight Time)
- A Report Creator can, for example, let the user select a GROUP BY column from ["dbo.pv.Sex", "dbo.pv.Income"] as an optional parameter, but then rename it in the final dataframe to conditionally on the selection, e.g. "Gender" or "Socioeconomic Background". The way you can achieve this is by using the "display" value on ParameterOption: