Methods: We use data from the Pittsburgh Youth Study to explore the potential contribution of machine learning methods to screening efforts in pediatric or school settings. Specifically, we use teacher report of child behavior during fourth or fifth grade (item parcels from the Teacher Report Form) to predict official records of arrests for violent crimes later in life. First, we train nine different predictive algorithms to optimally predict which participants will be arrested for a violent crime ([1] logistic regression, [2] lasso regression, [3] classification tree, [4] bagged classification trees, [5] boosted classification trees, [6] random forest, [7] k-nearest neighbors, [8] support vector machines, and [9] neural nets). Second, we evaluate the absolute and relative performance of these algorithms on holdout (i.e., new) data from the same sample.
Results: Results indicated that more sophisticated machine learning algorithms such as random forest and boosting outperform traditional logistic regression in predicting which children will later be arrested for a violent crime. Collapsing across all algorithms, the predictor variables identified as most important were child’s aggression, oppositionality/defiance, lack of guilt, academic achievement, and inattention. There was significant dropoff from performance in the training data to performance in the testing data, indicating the importance of evaluating a screener’s prediction in holdout data.
Conclusion: The present study suggests that machine learning methods can contribute to the identification of those individuals that will later be arrested for a violent crime. Future directions include the incorporation of larger and more varied set of predictors, the integration of multiple longitudinal datasets to increase training capacity, and consideration of how these methods might be implemented