2D image-based 3D shape retrieval (2D-to-3D) investigates the problem of matching the relevant 3D shapes from gallery dataset when given a query image. Recently, adversarial training and environmental style transfer learning have been successful applied to this task and achieved state-of-the-art performance. However, there still exist two problems. First, previous works only concentrate on the connection between the label and representation, where the unique visual characteristics of each instance are paid less attention. Second, the confused features or the transformed images can only cheat the discriminator but can not guarantee the semantic consistency. In another words, features of 2D desk may be mapped nearby the features of 3D chair. In this paper, we propose a novel semantic consistency guided instance feature alignment network (SC-IFA) to address these limitations. SC-IFA mainly consists of two parts, instance visual feature extraction and cross-domain instance feature adaptation. For the first module, unlike previous methods, which merely employ 2D CNN to extract the feature, we additionally maximize the mutual information between the input and feature to enhance the capability of feature representation for each instance. For the second module, we first introduce the margin disparity discrepancy model to mix up the cross-domain features in an adversarial training way. Then, we design two feature translators to transform the feature from one domain to another domain, and impose the translation loss and correlation loss on the transformed features to preserve the semantic consistency. Extensive experimental results on two benchmarks, MI3DOR and MI3DOR-2, verify SC-IFA is superior to the state-of-the-art methods.