Conventional methods for the early fusion of multi-modal features cannot recognize the relevant modality corresponding to the demand of each user in sequential recommendation. In this paper, we propose the adaptive multi-modal bidirectional long short-term memory network (AM-Bi-LSTM) to recognize the relevant modality for sequential recommendation. Specifically, we construct a new recurrent neural network model that is based on the bidirectional long short-term memory network and obtains multimodal features, including each user's sequential actions. Our new modality attention module calculates the importance degree of multi-modal features for sequential operations via the late-fusion approach, which results in the method recognizing the relevant modality. In experiments on a multi-modal and sequential dataset including 14,941 clicks constructed from the largest Web service for teachers in Japan, we demonstrate that AM-Bi-LSTM outperforms existing methods in terms of the diversity, explainability, and accuracy of recommendation. Specifically, we obtain Recall@10 that is 0.1005 better than that of existing early-fusion methods. Moreover, we obtain a value of catalog coverage@10 (representing diversity) that is 0.1710 higher than that for existing methods.