Cluster randomized trials (CRTs) are common in prevention research where the treatment variable is defined at the cluster (e.g., school) level. Missing data are common in CRTs. The recommended strategies, such as normal-model (NM) multiple imputation (MI), are good in many applications. However, we now know that these methods are not appropriate for CRTs. Ignoring the cluster structure with NM-MI underestimates the intraclass correlation (ICC), increasing type I errors. Including dummy variables with NM-MI to represent cluster membership was thought to be a good solution. Recent work, however, shows that the dummy coding scheme represents a fixed effect for cluster membership (Andridge, 2011), and overestimates the ICC, increasing type II errors. We need an approach that treats cluster membership as a random effect. At present the only software that does this is PAN (Schafer, 2001), which is currently available only as an R application. Because most prevention researchers are not familiar with R, we provide a step-by-step demonstration of performing imputation with PAN through SAS.
Methods:
The detailed procedure for MI with PAN through SAS is demonstrated step by step. Researchers can perform their own MI with PAN through SAS by changing the bold underlined text to suit their own data. An empirical data example was also included; we compared parameter estimates, ICC estimates, and statistical conclusions with data imputed from four types of MI: (a) NM-MI ignoring cluster structure; (b) NM-MI with dummy-coded cluster variables (fixed cluster effect); (c) a hybrid NM-MI which imputes half the time ignoring cluster structure, and half the time with the dummy variables; and (d) PAN (random cluster effect). We also presented figures based on complete cases analysis.
Results:
The empirical analysis showed that using PAN and the other types of MI produced comparable parameter estimates. However, the dummy-coding MI overestimated the ICC, whereas MI ignoring the cluster effect and the hybrid MI underestimated the ICC. When compared with PAN (p = .0088), the p-value for the treatment effect was higher with dummy-coding MI (.0091), and lower with MI ignoring cluster structure (.0014), and with the hybrid MI approach (0.0055). Analysis with the complete cases produced biased parameter estimates and ICC.
Conclusions:
NM-MI is not appropriate for handling missing data in CRTs. This approach leads to biased ICC and faulty statistical conclusions. Instead, imputation in CRTs should be performed with PAN. We have demonstrated an easy way for using PAN through SAS.