Integrating Within Sources: Many big data sources have more than one data point per individual (e.g., annual surveys). Careful consideration must be made to the (ir)regularity with which data exists and whether data sources all provide data during the same time range. The following will be discussed pertaining to this topic during the presentation: (a) set ranges around relevant events, (b) model the length of time between events, (c) collapse time into flags or counts, (d) average over multiple time points, and (e) sample random time points.
Integrating Between Sources: We recommend a strategy of thoughtful model planning before data integration. Though big data techniques do remove some necessity for relying on theory, careful planning is still essential to choosing the right data. We provide examples and techniques to (a) find the authoritative data source, (b) evaluate objective vs self-report, (c) understand bounded and biased distributions, (d) assess the consistency and practicality of data sources, and (e) distilling multiple elements or sources.
Conclusions: The necessary decisions regarding data integration can be daunting. Unfortunately, there is no simple solution to determine the right way to combine data—because there are many right ways. We have learned which traditional and big data practices hold up to the realities of data that were collected without research in mind. The US Army is dedicated to using its large data stores wisely and consciously in the service of Soldiers. The lessons learned in these pursuits are relevant to military and civilian researchers alike.