Brain-computer interface (BCI) is a useful device for people without relying on peripheral nerves and muscles. However, the performance of the event-related potential (ERP)based BCI declines when applying it to real environments, especially in cross-state and cross-subject conditions. Here we employ temporal modeling and adversarial training to improve the visual ERP-based BCI under different mental workload states and to alleviate the problems above. The rationality of our method is that the ERP-based BCI is based on electroencephalography (EEG) signals recorded from the scalp's surface, continuously changing with time and somewhat stochastic. In this paper, we propose a hierarchical recurrent network to encode all ERP signals in each repetition at the same time and model them with a temporal manner to predict which visual event elicited an ERP. The hierarchical architecture is a simple yet effective method for organizing recurrent layers in a deep structure to model long sequence signals. Taking a cue from recent advances in adversarial training, we further applied dynamic adversarial perturbations to create adversarial examples to enhance the model performance. We conduct our experiments on one published visual ERP-based BCI task with 15 subjects and 3 different auditory workload states. The results indicate that our hierarchical method can effectively model the long sequence EEG raw data, outperform the baselines on most conditions, including cross-state and cross-subject conditions. Finally, we show how deep learning-based methods with limited EEG data can improve ERP-based BCI with adversarial training. Our code will be released at https://github.com/aispeech-lab/VisBCI.