Abstract: Building a (Synthetic) Army: Differential Privacy, Imputation, and the Creation of Synthetic Data for the US Military (Society for Prevention Research 27th Annual Meeting)

400 Building a (Synthetic) Army: Differential Privacy, Imputation, and the Creation of Synthetic Data for the US Military

Thursday, May 30, 2019
Seacliff B (Hyatt Regency San Francisco)
* noted as presenting author
Adam Lathrop, MLIS, Data Scientist, Army Analytics Group/Research Facilitation Lab, Monterey, CA
Synthetic data has found an important place in the public health community as a means to share data with researchers outside of the organization responsible for the data while protecting the privacy of individuals and their highly sensitive health information. Using a variety of methodologies, data synthesis allows individuals captured in data to be protected by generalizing or obfuscating data that could compromise their identity, while collectively retaining the originals distribution and variance. It represents “silver standard” data for researchers and analysts to conduct preliminary analysis on data and assess its relevance for research, and if they determine it worthy for research, can pursue gaining the necessary compliance and security determinations to access the original, “gold-standard” data.

The challenge faced by the Army Analytics Group / Research Facilitation Laboratory’s Synthdata project was to generate synthetic datasets using data on over 400,000 Active Duty soldiers that, from a machine learning and statistical perspective, resembled the real United States Army, but on an individual level could not be mapped to a single person. To do so, a mixed methods approach to synthesizing data was iteratively developed that utilized the injection of randomized error into the original data and the imputation of new values based on the output of a K-Nearest neighbor model. The resulting data was then evaluated on a variety of criteria to ensure they met the requirements for privacy and accuracy.

Although the machine learning model employed during full imputation made it extremely unlikely that a synthesized individual could be perfectly mapped to those used to inform the model, if was confirmed that rare classifiers that could be used to identify individuals in the original data were not present in the synthesized data. To assess accuracy, counts of the data were used to ensure that the multinomial distribution of variables generally stayed consistent between the sets. Furthermore, to assess the covariance between variables, the difference in eigenvalues for the covariance matrices of the original and synthetic datasets was computed to ensure the correlation between variables was minimized in the synthesis process. After the above methods were used in the creation of three synthetic sets of data, the “silver standards” were made available for request by prospective PDE researchers.