Direct soil temperature (ST) measurement is time-consuming and costly; thus, the use of a simple and cost-effective machine learning (ML) tool is helpful. In this study, ML approaches, including KStar, instance-based K-nearest learner (IBK) and locally weighted learner (LWL) coupled with resampling algorithms of bagging (BA) and dagging (DA) were developed and tested for multi-step ahead (3, 6 and 9 days ahead) ST forecasting. In addition, a linear regression model (LR) was used as a benchmark to compare the results. A dataset with daily ST time-series (as models’ output) along with meteorological data (mean (TMean), minimum (TMin) and maximum (TMax) air temperature, evaporation (Eva), sunshine hours (SSH) and solar radiation (SR); as models’ input) were collected at Isfahan synoptic station (Iran), in a farmland, during 13 years (1992–2005) at 5 and 50 cm soil depths. Six different input combination scenarios were proposed to the models based on Pearson’s correlation coefficients between inputs and outputs. For the model building, we used 70% of the data and the remaining 30% was considered for model evaluation through different visual and quantitative metrics. Our findings showed that variable TMean is the most effective input variable for ST forecasting in most of the developed algorithms, while in some cases the combination of several variables including TMean, TMax and as well as the integration of TMean, TMax, TMin, Eva and SSH proved to be the best input combinations. Among the evaluated models, KStar showed more compatibility with the BA algorithm, while, in most cases and depending on soil depth, IBK and LWL obtained more accurate results when they were hybridized with DA. For soil depth of 5 cm, BA-KStar has superior performance (i.e. Nash-Sutcliffe Efficiency (NSE) = 0.90, 0.87 and 0.85 for 3, 6 and 9 months ahead forecasting, respectively) while for soil depth of 50 cm, DA-KStar outperforms other algorithms (i.e. NSE = 0.88, 0.89 and 0.89 for 3, 6 and 9 months ahead forecasting, respectively). Also, results confirmed that all hybrid models had higher prediction capability than the LR model.