“…Regarding feature alignment, there are a lot of approaches [ 5 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 ]. The most popular architecture [ 5 , 35 , 48 ] is a double-stream deep network, where shallow layers are independent for learning modal-specific features and deep layers are shared for learning modal-common features.…”