Enhancing Crash Injury Severity Prediction on Imbalanced Crash Data by Sampling Technique with Variable Selection

¹Southwest Jiaotong University

Details

12:00 - 12:15 | Mon 28 Oct | Crystal Room I | MoD-T5.1

Session: Special Session on Big Data and Emerging Technologies for Traffic Safety Improvement (II)

Full Text

Abstract

The analysis of road crash data has long been used as a premise for influencing the road and automobile designs and guiding the implementation of various policies with the view to enhance the road safety. However, the crash data is associated with class imbalance and high dimensionality, which may severely impact the predictions of the analysis model. This study suggests a framework for the combined use of variable selection and the synthetic minority over sampling (SMOTE) data balance techniques. We explored three variable selection (VS) techniques including two filter-based i.e., Chi-square (CS) and correlation feature selection (CFS) and an embedded method i.e., random forest (RF). To study the imbalance data problem and the implications for VS, two training data scenarios were considered: (1) VS based on original data and modelling based on original data (2) VS based on sampled data and modelling based on original data. The impact of varying the data class distribution was also examined. The Naïve Bayes classifiers were trained on the various selection subsets and their predictions captured in two metrics types. Overall, 14 models were developed and analysed. The empirical results demonstrate that using balanced data can be helpful to identify the most prolific predictors of the crash injury severity. The filter-based ranking methods are more robust against the data imbalance than the wrapper. The NB classifier produced better predictions on the optimal subsets identified by the filter-based method than the one chosen by the wrapper.