Energy load forecasting is a critical component of energy system scheduling and optimization. This method, which is classified as a time-series forecasting method, uses prior features as inputs to forecast future energy loads. Unlike a traditional single-target scenario, an integrated energy system has a hierarchy of many correlated energy consumption entities as prediction targets. Existing data-driven approaches typically interpret entity indexes as suggestive features, which fail to adequately represent interrelationships among entities. This paper, therefore, proposes a neural network model named Cross-entity Temporal Fusion Transformer (CETFT) that leverages a cross-entity attention mechanism to model inter-entity correlations. The enhanced attention module is capable of mapping the relationships among entities within a time window and informing the decoder about which entity in the encoder to concentrate on. In order to reduce the computational complexity, shared variable selection networks are adapted to extract features from different entities. A data set obtained from 13 buildings on a university campus is used as a case study to verify the performance of the proposed approach. Compared to the comparative methods, the proposed model achieves the smallest error on most horizons and buildings. Furthermore, variable importance, temporal correlations, building relationships, and time-series patterns in data are analyzed with the attention mechanism and variable selection networks, therefore the rich interpretability of the model is verified.