Abstract: Beyond Duct Tape and Baling Wire: Realistic Strategies for Integrating Large Datasets (Society for Prevention Research 27th Annual Meeting)

401 Beyond Duct Tape and Baling Wire: Realistic Strategies for Integrating Large Datasets

Schedule:
Thursday, May 30, 2019
Seacliff B (Hyatt Regency San Francisco)
* noted as presenting author
Alycia Perez, PhD, Psychological Research Scientist, Army Analytics Group/Research Facilitation Lab, Monterey, CA
While it’s true that having large quantities and multiple sources of data can mitigate many traditional problems, using big data also introduces new problems and demands different considerations. Researchers must make difficult decisions regarding what data to incorporate and how to handle inconsistency or complexity. The Army Analytics Group / Research Facilitation Laboratory has pioneered and refined the process of fusing disparate datasets to inform large-scale behavioral health evaluations and studies from an immense catalog of data. The US Army collects and securely stores considerable amounts of data to serve a multitude of purposes. Often we sew together medical, demographic, organizational, self-report, geographic, and contextual data. We must also contend with sparsely populated data, changes to forms and surveys, bounded distributions, constantly changing hierarchical structure, and inconsistent values over time and across sources. In this paper, we share some of our best practices for handling real-world big data, illustrated by our actual challenges and solutions.

Integrating Within Sources: Many big data sources have more than one data point per individual (e.g., annual surveys). Careful consideration must be made to the (ir)regularity with which data exists and whether data sources all provide data during the same time range. The following will be discussed pertaining to this topic during the presentation: (a) set ranges around relevant events, (b) model the length of time between events, (c) collapse time into flags or counts, (d) average over multiple time points, and (e) sample random time points.

Integrating Between Sources: We recommend a strategy of thoughtful model planning before data integration. Though big data techniques do remove some necessity for relying on theory, careful planning is still essential to choosing the right data. We provide examples and techniques to (a) find the authoritative data source, (b) evaluate objective vs self-report, (c) understand bounded and biased distributions, (d) assess the consistency and practicality of data sources, and (e) distilling multiple elements or sources.

Conclusions: The necessary decisions regarding data integration can be daunting. Unfortunately, there is no simple solution to determine the right way to combine data—because there are many right ways. We have learned which traditional and big data practices hold up to the realities of data that were collected without research in mind. The US Army is dedicated to using its large data stores wisely and consciously in the service of Soldiers. The lessons learned in these pursuits are relevant to military and civilian researchers alike.