Introduction: The modern age has provided huge quantities of data. Several “big data” sources (e.g., healthcare databases) can help in enhancing our understanding of human behavior and development. Yet, much of this data goes untapped in large part because the common statistical methods used in prevention research are not built for data with large numbers of variables and observations. Often in these situations, a researcher must rely on theory or risk inflating the Type-I error rate (i.e., the probability of rejecting the null when it should not be) by testing many associations. This problem is particularly discouraging when theory is not clear, the project is examining undocumented relationships, or the type of relationship between variables is unknown. Spurious results stemming from over-testing can waste researchers’ time and financial capital in chasing outcomes that will likely not be replicable. Because prevention research informs policy and intervention, false results can have lasting effects on individual lives. Fortunately, methods exist that can, without inflating error, explore large quantities of variables, model linear and nonlinear effects, and instinctively account for interactions among the variables. We demonstrated three of these methods that are of particular utility to prevention research to explore important predictors of adolescents’ and young adults’ time spent per day physically active.
Methods: Conditional Inference Trees, Conditional Inference Forests and Random Forestswere used to identify predictors of adolescent/young adult physical activity among nearly 300 demographic and health variables from a large national survey from 2013 to 2014. We compared the methods in terms of the resulting important predictors and model accuracy, and presented their utility in exploring large data sets.
Results: Several variables were important predictors of physical activity: perceived level of activity necessary for health, the number of times eating out, waist circumference, drug usage, and eating behaviors. The nature of the relationships with physical activity were graphically presented for understanding of the direction and magnitude of the effect, many showing non-linear relationships that would otherwise be difficult to discover with more traditional methods.
Conclusions: Conditional Inference Trees/Forests and Random Forests can enhance researchers’ understanding of their data without a detrimental increase in error. By using these methods to study adolescent/young adult physical activity, we were able to support and extend previous research in the area, potentially leading to further analysis and integration of these relationships in more theoretical frameworks.