“…In this section, we comprehensively analyze and evaluate VTNs on two tasks: fine-grained image recognition and instance-level image retrieval. First, we analyze the influence of the different components of VTNs compared to existing spatial deformation modeling methods [28,10,48,73] and the impact of combining VTNs with different backbone networks [55,21] and second-order pooling strategies [38,17,9,36,35]. Second, we compare VTNs with the state-of-the-art methods on fine-grained image recognition benchmarks [64,32,40,22].…”