The automatic diagnosis of various retinal diseases from fundus images is important to support clinical decisionmaking. However, developing such automatic solutions is challenging due to the requirement of a large amount of humanannotated data. Recently, unsupervised/self-supervised feature learning techniques receive a lot of attention, as they do not need massive annotations. Most of the current self-supervised methods are analyzed with single imaging modality and there is no method currently utilize multi-modal images for better results. Considering that the diagnostics of various vitreoretinal diseases can greatly benefit from another imaging modality, e.g., FFA, this paper presents a novel self-supervised feature learning method by effectively exploiting multi-modal data for retinal disease diagnosis. To achieve this, we first synthesize the corresponding FFA modality and then formulate a patient feature-based softmax embedding objective. Our objective learns both modality-invariant features and patient-similarity features. Through this mechanism, the neural network captures the semantically shared information across different modalities and the apparent visual similarity between patients. We evaluate our method on two public benchmark datasets for retinal disease diagnosis. The experimental results demonstrate that our method clearly outperforms other self-supervised feature learning methods and is comparable to the supervised baseline. Our code is available at GitHub 1. Index Terms-Retinal disease diagnosis, self-supervised learning, multi-modal data I. INTRODUCTION C OLOR fundus photography has been widely used in clinical practice to evaluate various conventional ophthalmic diseases, e.g., age-related macular degeneration (AMD) [1], pathologic myopia (PM) [2], and diabetic retinopathy [3, 4]. Recently, deep learning has shown very good performance on a variety of automatic ophthalmic disease detection problems from fundus images [5-7], and these techniques can help ophthalmologists in decision making. The success is attributed to the learned representative features from fundus images, which requires a large amount of training data with massive human annotations. However, it is tedious and expensive to annotate the fundus images, since experts are needed to provide reliable labels. Hence, in this paper, our goal is to learn the representative features from data itself, without any human