2 Type IV secretion systems exist in a number of bacterial pathogens and are used to secrete effector proteins directly into 3 host cells in order to change their environment making the environment hospitable for the bacteria. In recent years, 4 several machine learning algorithms have been developed to predict effector proteins, potentially facilitating experimental 5 verification. However, inconsistencies exist between their results. Previously we analysed the disparate sets of predictive 6 features used in these algorithms to determine an optimal set of 370 features for effector prediction. This work focuses on 7 the best way to use these optimal features by designing three machine learning classifiers, comparing our results with 8 those of others, and obtaining de novo results. We chose the pathogen Legionella pneumophila strain Philadelphia-1, a 9 cause of Legionnaires' disease, because it has many validated effector proteins and others have developed machine 10 learning prediction tools for it. While all of our models give good results indicating that our optimal features are quite 11 robust, Model 1, which uses all 370 features with a support vector machine, has slightly better accuracy. Moreover, 12 Model 1 predicted 760 effector proteins, more than any other study, 315 of which have been validated. Although the 13 results of our three models agree well with those of other researchers, their models only predicted 126 and 311 candidate 14 effectors. 15 16 Introduction 17 Bacterial pathogens can use secretion systems to deliver proteins to the host cell. There are nine known secretion systems, 18 but the focus of this work is on the type IV secretion system (T4SS). The T4SS is composed of multiple proteins 102 features were divided among our three classifiers as follows: i) features related to PSSM composition, ii) features related 103 to the auto-covariance correlation of PSSM, and iii) chemical, structural, and compositional features [S1 Table] (e.g., 104 amino acid composition, dipeptide composition, average hydropathy, total hydropathy, hydropathy of C terminal,105 hydropathy of N terminal, number of coiled coil regions, signal peptide probability, polarity, molecular mass, length, and 106 homology to known effectors). For our second ensemble classifier, Model 3, the three groups of features divided among 107 our classifiers were as follows: i) PSSM-related features (PSSM composition and auto covariance correlation of PSSM), 108 ii) features related to the composition of amino acids in protein sequences (amino acid composition and dipeptide 109 composition), and iii) chemical and structural features (average hydropathy, total hydropathy, hydropathy of C terminal, 110 hydropathy of N terminal, number of coiled coil regions, signal peptide probability, polarity, molecular mass, length, and 111 homology to known effectors).
112After building our dataset and designing our machine learning classifiers, we used 10-fold cross-validation to 113 validate our models and to test for overfitting in the results. The...