Abstract: A Methodological Framework to Link Datasets from Multiple Population-Level Data Sources (Society for Prevention Research 24th Annual Meeting)

607 A Methodological Framework to Link Datasets from Multiple Population-Level Data Sources

Friday, June 3, 2016
Regency B (Hyatt Regency San Francisco)
* noted as presenting author
Roland Estrella, MS, Data Custodian, University of Florida, Gainesville, FL
Jeffrey Roth, PhD, Research Professor, University of Florida, Gainesville, FL
Mildred Maldonado-Molina, PhD, Associate Professor, University of Florida, Gainesville, FL
Melissa Naidu, MAE, Research Coordinator, University of Florida, Gainesville, FL
Introduction: This paper presents a methodological framework to link datasets across multiple state-level, population-based data sources. The resulting longitudinal, multi-source datasets can be used to better understand the combination of antecedents within structural, social, and cultural contexts associated with health outcomes.  Topics of interest such as developing an understanding of the developmental outcomes of babies born to mothers who are addicted to prescription drugs; the behavioral and health outcomes among children in the juvenile justice system; strategies to reduce teen pregnancy and risky sexual behaviors; and the health and behavioral outcomes among children in foster care; are some examples of studies that can be conducted with datasets constructed using the methodological record linkage framework. Development of a 15-year linked data repository on all Florida women and children (more than 12 million individuals) is used to demonstrate an application of the methodological framework.

Methods: The multi-step linkage methodology sequentially improves the accuracy of the matches that make up the longitudinal, multi-source datasets. The methodology consists of: 1) development of self-correcting, patient-level custom linkage profiles across databases, 2) deterministic (rule-based) record linkage using exact and fuzzy text matching techniques, 3) probabilistic linkage using data mining algorithms, and 4) clerical-review record linkage.

Result:The linked, 15-year repository developed using the multi-step linkage methodology hosts multiple linked statewide data sources including birth, death, and fetal death certificates; Medicaid eligibility, encounters, and claims; hospital discharge, ambulatory, and emergency records; Healthy Start prenatal screens; Perinatal Intensive Care records; Early Intervention Program records; birth anomalies; academic performance; juvenile delinquency; and child maltreatment and foster care placement records. De-identified or limited datasets with linked records can be constructed to answer interdisciplinary research questions involving predictors and outcomes across multiple generations. These datasets are unique in that individuals are linked to their immediate family in multiple databases across years. This benefit allows researchers to examine the start of adverse conditions and then chart the trajectory of multiple risk and protective factors across the life span for individuals and their immediate family. In addition, researchers are able to identify how these factors are associated with the development of other conditions (e.g. asthma, diabetes, obesity).

Conclusion: We present a novel method for linking and cataloging data across multiple, disparate data sources. The linked data can be used to better understand how individuals and their immediate family interact within a set of structured, interconnected systems including family, peers, school, and community, as well as more macro-level influences such as the health care system and the social welfare system. The linkage methodology allows advancing knowledge on reduction of disparities in long-term outcomes and promoting health equity among individuals exposed to high-risk settings.