Skip to main content

Private Set Intersection (PSI) Tasks

What is Private Set Intersection?

Private set intersection (PSI) is a privacy-preserving technique which falls under the umbrella of secure multi-party computation (SMPC) technologies. These technologies enable multiple parties to perform operations on disparate datasets without revealing the underlying data to any party using cryptographic techniques.

PSI itself allows you to determine the overlapping records in two disparate datasets without providing access to the underlying raw data of either dataset, hence why it is referred to as an intersection. When applied to the Bitfount context, a Data Scientist can perform a PSI across a local dataset and a specified Pod and return the matching records from the local dataset.

Why run a PSI?

There are many common use cases for running PSI tasks:

  1. You wish to understand the overlap between your data and a prospective partner's data before partnering without putting either party's data at risk.
  2. Two entities are exploring a merger or acquisition scenario and wish to understand potential data synergies prior to executing the deal.
  3. You wish to evaluate whether a dataset will provide sufficient new records to be additive to your model's power vs. duplicative of existing training data.

And many more!

Running PSI Tasks

Bitfount does not currently enable Data Custodians to restrict usage of a given Pod to PSI only. This means Data Scientists will not run into the PrivateSetIntersection protocol as a requirement, but rather may want to run PSI tasks to understand the overlap of two datasets. To do so, the Data Scientist will need to have Super Modeller or General Modeller permissions to the Pod. The PSI protocol is compatible with the ComputeIntersectionRSA algorithm.

Bitfount's current implementation of PSI enables you to compute the overlap of records between one Pod and your local dataset and returns the matching records from the local dataset for comparison. This means that before running a PSI task, you need to:

  1. Check that you have access to the Pod you wish to include in the overlap calculation.
  2. Check that you can specify a local datasource for comparison.
  3. Determine which set of columns need to be compared across datasources. Note that you will need to ensure your local datasource matches the column order and names of those in the Pod you are comparing them to in order for the PSI to compute properly.

Once you are ready to perform the PSI task, you can run it!

Example configuration for a local dataframe source:

d = {'col1': [1,3,5,7,9,11]}

pod_identifier = "psi-demo"
algorithm = ComputeIntersectionRSA()

start = time.time()
res = algorithm.execute(
datasource=DataFrameSource(pd.DataFrame(d)),
pod_identifiers=[pod_identifier],
identity_verification_method = "oidc-auth-code")
end = time.time()
print(size, end - start)
print(res)

Running time of PSI tasks

Private Set Intersection requires a significant amount of computation. This computation is linear in the size of both the query itself and the database being queried. When running PSI we recommend starting with smaller numbers of entries to understand the way it scales before executing larger queries.

Next Steps

You did it! For more detailed illustrations of the Bitfount product suite, feel free to check out our tutorials.