The real-time crash likelihood prediction has been an important research topic. Various classifiers, such as support vector machine (SVM) and tree-based boosting algorithms, have been proposed in traffic safety studies. However, few research focuses on the missing data imputation in real-time crash likelihood prediction, although missing values are commonly observed due to breakdown of sensors or external interference. Besides, classifying imbalanced data is also a difficult problem in real-time crash likelihood prediction, since it is hard to distinguish crash-prone cases from non-crash cases which compose the majority of the observed samples. In this paper, principal component analysis (PCA) based approaches, including LS-PCA, PPCA, and VBPCA, are employed for imputing missing values, while two kinds of solutions are developed to solve the problem in imbalanced data. The results show that PPCA and VBPCA not only outperform LS-PCA and other imputation methods (including mean imputation and k-means clustering imputation), in terms of the root mean square error (RMSE), but also help the classifiers achieve better predictive performance. The two solutions, i.e., cost-sensitive learning and synthetic minority oversampling technique (SMOTE), help improve the sensitivity by adjusting the classifiers to * Corresponding author pay more attention to the minority class. Keywords: Real-time crash likelihood prediction, PCA-based missing data imputation, cost-sensitive learning, SMOTE, support vector machine, AdaBoost 1. Introduction Prediction of traffic crash has been a major research topic in transportation safety studies. Crashes, especially on urban expressways, can trigger heavy traffic congestions, impose huge external costs, and reduce the level of service of transportation infrastructures. Therefore, the accurate and reliable prediction of crash risks is critical to the success of proactive safety management strategies on urban expressways. There have been fruitful studies in the domain of the real-time crash likelihood estimation (Abdel-Aty and Pemmanaboina, 2006; Abdel-Aty et al., 2007, 2008; Ahmed and Abdel-Aty, 2012). It has been reported that crash occurrence was affected by four major factors: real-time traffic state, drivers' behavior, environment factors, and road geometry (Ahmed and Abdel-Aty, 2013b). Traditional devices utilized in detecting real-time traffic states are mainly intrusive, e.g., loop detectors. Recently, more non-intrusive traffic detection devices are in use due to their easiness of installation, maintenance, accuracy, and affordable costs. For example, Remote Traffic Microwave Sensors (RTMS) and Automatic Vehicle Identification (AVI) devices provide access to real-time traffic data from multiple sources. In field applications, RTMS simultaneously provide real-time data of flow, time occupancy, and speed.Despite RTMS or other detectors (e.g., AVI devices and loop detectors) have been widely used and successfully applied in traffic operations including the real-time crash likelihood estimat...