We develop a novel end-to-end trainable feature selection-forecasting (FSF) architecture for predictive networks targeted at the Internet of Things (IoT). In contrast with the existing filter-based, wrapper-based and embedded feature selection methods, our architecture enables the automatic selection of features dynamically based on feature importance score calculation and gamma-gated feature selection units that are trained jointly and end-to-end with the forecaster. We compare the performance of our FSF architecture on the problem of forecasting IoT device traffic against the following existing (feature selection, forecasting) technique pairs: Autocorrelation Function (ACF), Analysis of Variance (ANOVA), Recurrent Feature Elimination (RFE) and Ridge Regression methods for feature selection, and Linear Regression, Multi-Layer Perceptron (MLP), Long Short Term Memory (LSTM), 1 Dimensional Convolutional Neural Network (1D CNN), Autoregressive Integrated Moving Average (ARIMA), and Logistic Regression for forecasting. We show that our FSF architecture achieves either the best or close to the best performance among all of the competing techniques by virtue of its dynamic, automatic feature selection capability. In addition, we demonstrate that both the training time and the execution time of FSF are reasonable for IoT applications. This work represents a milestone for the development of predictive networks for IoT in smart cities of the near future.