Vegetable and fruit recognition can be considered as a fine-grained visual categorization (FGVC) task, which is challenging due to the large intraclass variances and small interclass variances. A mainstream direction to address the challenge is to exploit fine-grained local/global features to enhance the feature extraction and representation in the learning pipeline. However, unlike the human visual system, most of the existing FGVC methods only extract features from individual images during training. In contrast, human beings can learn discriminative features by comparing two different images. Inspired by this intuition, a recent FGVC method, named Attentive Pairwise Interaction Network (API-Net), takes as input an image pair for pairwise feature interaction and demonstrates superior performance in several open FGVC data sets. However, the accuracy of API-Net on VegFru, a domain-specific FGVC data set, is lower than expected, potentially due to the lack of spatialwise attention. Following this direction, we propose an FGVC framework named Attention-aware Interactive Features Network (AIF-Net) that refines the API-Net by integrating an attentive feature extractor into the backbone network. Specifically, we employ a region proposal network (RPN) to generate a collection of informative regions and apply a biattention module to learn global and local attentive feature maps, which are fused and fed into an interactive feature learning subnetwork. The novel neural structure is verified through extensive experiments and shows consistent performance improvement in comparison with the SOTA on the VegFru data set, demonstrating its superiority in fine-grained vegetable and fruit recognition. We also discover that a concatenation fusion operation applied in the feature extractor, along with three top-scoring regions suggested by an RPN, can effectively boost the performance.