The challenge faced by the Army Analytics Group / Research Facilitation Laboratory’s Synthdata project was to generate synthetic datasets using data on over 400,000 Active Duty soldiers that, from a machine learning and statistical perspective, resembled the real United States Army, but on an individual level could not be mapped to a single person. To do so, a mixed methods approach to synthesizing data was iteratively developed that utilized the injection of randomized error into the original data and the imputation of new values based on the output of a K-Nearest neighbor model. The resulting data was then evaluated on a variety of criteria to ensure they met the requirements for privacy and accuracy.
Although the machine learning model employed during full imputation made it extremely unlikely that a synthesized individual could be perfectly mapped to those used to inform the model, if was confirmed that rare classifiers that could be used to identify individuals in the original data were not present in the synthesized data. To assess accuracy, counts of the data were used to ensure that the multinomial distribution of variables generally stayed consistent between the sets. Furthermore, to assess the covariance between variables, the difference in eigenvalues for the covariance matrices of the original and synthetic datasets was computed to ensure the correlation between variables was minimized in the synthesis process. After the above methods were used in the creation of three synthetic sets of data, the “silver standards” were made available for request by prospective PDE researchers.