Method. The project merges three administrative data sources of population-level information for Baltimore City since 2000: birth certificate data, individual level lead measurements, and public school records. Extensive care has been taken in order to harmonize the data appropriately. The matching process use a probabilistic ("fuzzy") matching algorithm on child’s first name, last name, and date of birth (Christen, 2012; Harron, Goldstein, & Dibben, 2016; Wasi & Flaaen, 2015). Data that has been harmonized includes birth record data from the Maryland Department of Health and Mental Hygiene/Baltimore City Health Department. The vital statistics include date of birth, birth weight and length, child gender, clinical estimate of gestation, race and ethnicity, maternal age, maternal education level, Medicaid eligibility, parental marital status, prenatal care, and census tract. Lead Registry Data from the Maryland Department of the Environment is available beginning in 1992 and spanning twenty-four years. Public school records from 2000 to 2018 include grade promotion, achievement test scores, disciplinary records, and other outcomes.
Results. The matching occurred in stages, with careful consideration to personally identifiable information. Prior to the matching process, the study team met with stakeholders including individuals at the city school system, the state lead commission, community leaders, and parents. After the matching was completed, an anonymous file stripped of all personally identifiable information was produced for analysis.
Discussion. This big data project has the potential to inform a number of local and national policies as we move forward to better understand long-term impacts of early life lead exposure. It also brings to light a number of challenges associated with the use of educational and health data within a big data framework. We will discuss both the challenges and potential solutions to the ethical, data security, and analytic concerns of harmonizing health and educational data.