Recently, practical brain-computer interface is actively carried out, especially, in an ambulatory environment. However, the electroencephalography signals are distorted by movement artifacts and electromyography signals in ambulatory condition, which make hard to recognize human intention. In addition, as hardware issues are also challenging, ear-EEG has been developed for practical brain-computer interface and is widely used. However, ear-EEG still contains contaminated signals. In this paper, we proposed robust two-stream deep neural networks in walking conditions and analyzed the visual response EEG signals in the scalp and ear in terms of statistical analysis and braincomputer interface performance. We validated the signals with the visual response paradigm, steady-state visual evoked potential. The brain-computer interface performance deteriorated as 3~14% when walking fast at 1.6 m/s. When applying the proposed method, the accuracies increase 15% in cap-EEG and 7% in ear-EEG. The proposed method shows robust to the ambulatory condition in session dependent and session-to-session experiments.