“…External estimates of probability that a compound will undergo active efflux mediated by P-glycoprotein (P-gp) can also be included [21,24,37]. In other approaches, large pools of various 1D, 2D (including molecular fingerprints), and 3D molecular descriptors calculated by different methods [26,[38][39][40][41][42][43][44][45][46][47][48][49][50][51][52] are analyzed by various statistical learning techniques, e.g., multiple linear regression, linear discriminant analysis, partial least squares regression, support vector machines, artificial neural networks, random forests, etc., often in combination with some descriptor selection protocols [23,24,26,[43][44][45][46][47][48][49][50][51][52]. However, the limited size of the training sets, use of unverified data, and too-small modeling errors for such an inherently noisy endpoint often give rise to the concerns of possible model overfitting [16].…”