In this paper, we propose a novel control scheme for a vision-based prosthetic hand. To realize complex and flexible human-like hand movements, the proposed method fuses bimodal information. Combining information from surface EMG signals with object information from a vision sensor, the system can select an appropriate hand motion. The training/recognition using both sEMG signals and object images can be performed with a single deep neural network in an end-to-end manner. The bimodal sensor information enables the system to recognize the operator's intended motion with higher accuracy than that of the conventional method using only sEMG signals. In addition, the generalization ability of the network is improved, so motion recognition robustness is enhanced against abnormal data that include partly noisy or missing samples. To verify the validity of the proposed approach, we prepared a dataset that contains the sEMG signals and the object images for 10 types of grasping motions. Three kinds of experiments were conducted: comparison of the proposed method with the conventional method, examination of the recognition robustness against partly noisy or missing samples, and challenges to recognize hand motions based on raw sEMG signals. The results revealed that the proposed bimodal network achieved considerably high recognition performance.