Due to the volatility and randomness of the photovoltaic power generation, it is difficult for traditional models to predict it accurately. To solve the problem, we established a model based on the selfattention mechanism and multi-task learning to predict the ultra-short-term photovoltaic power generation. First, we selected the data with the optimal timing length and input the data into the Encoder-Decoder network based on the self-attention. The validity of features extracted by the encoder was checked by the decoder. Then, we added a restriction to the middle layer of the Encoder-Decoder network to prevent the autoencoder from copying the input to the output mechanically. This condition is used to predict the photovoltaic power generation, so a multi-task learning model was established. Finally, to take full advantage of the features that are efficiently expressed and allow our main task, the prediction task, to learn some unique features autonomously, we proposed a step-by-step training method and have validated the effectiveness of this view in experiments. Through experimental contrast, it is found that compared with the Encoder-Decoder network based on CNN and LSTM, the performance of the proposed method has been increased by 14.82% and 8.09% respectively. The RMSE and MAE of the Encoder-Decoder model based on the self-attention mechanism using step-by-step training are 0.071 and 0.040 respectively. INDEX TERMS Energy, PV power prediction, self-attention mechanism, multi-task learning, autoencoder.