Forecasting the subnational population accurately is needed for sustainable development, including planning for the future, allocating resources, or providing health services. Two approaches are used for forecasting subnational populations: local forecasting where a model is trained for each area, and global forecasting, where one model is trained with all areas. Local forecasting (e.g., statistical models) is limited to capturing the population growth patterns in a single area. Machine learning models, such as the light gradient boosting model (LGBM), are considered a more suitable approach for global forecasting, but it is limited to one-step predictions, leading to error accumulation. Also, combining several models into one ensemble model are used which helped in reduce forecasting errors. However, the nature of population growth is nonlinear, and there is a need to reduce error accumulation. This study overcomes these issues and proposes a population fusion transformer (PFT) as a global forecasting model for population forecasting, which outputs multi-step predictions. The PFT is based on a temporal fusion transformer (TFT) proposing a novel deep gated residual network (DGRN) block to capture data nonlinearity. This study also incorporates the proposed PFT model into various ensemble models to reduce forecasting errors using different prediction and learning approaches. The proposed models are applied to four subnational population datasets from several countries. The PFT model outperforms the LGBM and TFT with lower forecasting errors in three and two datasets. More importantly, combining the PFT with other models in ensemble models reduced errors further.