Abstract: Achieving Both Privacy Protection and High Quality Record Linkage By Controlling Disclosure (Society for Prevention Research 26th Annual Meeting)

357 Achieving Both Privacy Protection and High Quality Record Linkage By Controlling Disclosure

Schedule:
Thursday, May 31, 2018
Columbia A/B (Hyatt Regency Washington, Washington, DC)
* noted as presenting author
Hye-Chung Kum, PhD, Associate Professor, Texas A&M University, College Station, TX
Introduction: Leveraging the power of big data in prevention science to promote health and well-being often requires the ability to integrate heterogeneous person level data from different but relevant domains for a given problem. Exact match-based solutions are inadequate (e.g., matching on SSN) because real data contain variations in key fields due to error (e.g., typos), missing values (e.g., missing SSN), and legitimate change in identifiers (e.g., name changes). In practice, approximate methods for record-linkage are commonly used which typically requires humans interacting with the data to iteratively standardize and clean the data (e.g., ensuring the format of a field is consistent across databases) and to ensure the quality of the matches (e.g., we should not confuse a twins in the database). Thus, manual review of uncertain linkages produced from the automatic methods to fine tune the results and manage errors is critical.

Methods: We have developed the privacy preserving interactive record linkage (PPIRL) framework, which uses strategies to weigh tradeoffs between privacy and utility of data. The privacy objective of the PPIRL framework is to guarantee against sensitive attribute disclosure (e.g., cancer status) while minimizing identity disclosure (e.g., patient name). Meanwhile, the utility objective of the PPIRL framework is to generate the optimal matching function by allowing manual inspection of results from automatic linkage algorithms and clean and standardize messy data. This promotes both privacy and high quality linkages. We compared the quality of human decision-making in record linkage using a visual interface that controls the amount of personal information available using visual markup to highlight data discrepancies.

Results: Our study compared the quality of the record linkage decisions by the amount of characters disclosed. Results indicated that with good interface design, we could same comparable linkage decisions between the full mode, all information is fully disclosed, and moderate mode which only had 30% disclosure. We did see that as we masked more values for privacy, quality of results started to suffer (p<0.001). However, we also found that even for legally de-identified data, with proper masks it can be linked properly for most situations 0% disclosure still had 75% accuracy.

Conclusion: The results demonstrate that it is possible to greatly limit the amount of personal information available to human decision makers without negatively affecting utility or human effectiveness. Thus, incremental disclosure can significantly improve privacy protection with negligible impact on the quality of linkage.