predictions of distant cancer metastasis based on gene signatures are studied intensively to realise precise diagnosis and treatments. Gene selection i.e. feature selection is a cornerstone to both establish accurate predictions and understand underlying pathologies. Here, we developed a simple but robust feature selection method using a correlation-centred approach to select minimal gene sets that have both high predictive and generalisation abilities. A multiple logistic regression model was used to predict 5-year metastases of patients with breast cancer. Gene expression data obtained from tumour samples of lymph node-negative breast cancer patients were randomly split into training and validation data. Our method selected 12 genes using training data and this showed a higher area under the receiver operating characteristic curve of 0.730 compared with 0.579 yielded by previously reported 76 genes. The signature with the predictive model was validated in an independent dataset, and its higher generalization ability was observed. Gene ontology analyses revealed that our method consistently selected genes with identical functions which frequently selected by the 76 genes. Taken together, our method identifies fewer gene sets bearing high predictive abilities, which would be versatile and applicable to predict other factors such as the outcomes of medical treatments and prognoses of other cancer types. Breast cancer is the most common cancer in women worldwide 1. It is a heterogeneous disease that causes difficulties for its diagnosis and treatment. Thus, multiple clinicopathological factors, such as the tumour size and lymph node metastasis, and established biomarkers, such as the estrogen receptor (ER), human epidermal growth factor receptor 2 (HER2), and progesterone receptor, have been used to characterise specific features of breast cancer, including its subtypes. To realise more accurate characterisation of breast cancer, gene expression has been introduced to understand molecular-level pathology and prediction of treatment outcomes of breast cancer, which would contribute to the design of treatments 2,3. Currently, MammaPrint 4 , Oncotype DX 5 , and MapQuant Dx 6 are commercially available diagnostic tests to accurately estimate the recurrence of breast cancer. These tests employ relatively large numbers of genes, that is, 70, 21, and 97, respectively. Gene expression is also used for metastasis prediction and subtype identification of breast cancers 7-9. The use of gene expression in combination with known biomarkers and clinicopathological factors is considered as a critical issue for precision medicine. There are three processes in the establishment of a prediction model using microarray data: (1) feature selection, (2) development of a mathematical model, and (3) validation of the developed model. The purpose of feature selection is to identify a minimal predictive gene set by removing redundant and irrelevant features 10-12. This process is important because the subsequent analyses depend on the selec...