Plant specialized metabolites mediate interactions between plants and the environment and have significant agronomical/pharmaceutical value. Most genes involved in specialized metabolism (SM) are unknown because of the large number of specialized metabolites and the challenge in differentiating SM genes from general metabolism (GM) genes. We employed transfer learning, a machine learning strategy in which information from one species with substantially more experimentally derived function data (Arabidopsis thaliana) is used to build a model to predict gene functions in another species (Solanum lycopersicum). Using machine learning to integrate heterogenous gene features, we built models distinguishing tomato SM and GM genes. Although SM/GM genes can be predicted based on tomato data alone (F-measure=0.74, compared with 0.5 for random and 1.0 for perfect predictions), using information from Arabidopsis to filter likely misannotated genes significantly improves prediction (F-measure= 0.92). This study demonstrates that SM/GM genes can be better predicted by leveraging cross-species information.
SignificanceWith the increase of sequenced non-model species, a major challenge in plant biology is to ascertain gene function. Model species such as Arabidopsis thaliana have large amounts of experimentally-backed annotations that non-model species lack. We show how to use a model species to better annotate the function of genes in a non-model species using a technique called transfer learning. In particular, we focus on genes involved in specialized metabolism (SM), or metabolism specific to a certain plant lineage, which are not well known because of the sizeable diversity of specialized metabolites (SMs) among plant species. We use Arabidopsis to predict SM genes in tomato, a species with many SMs of interest but with a poorer annotation than Arabidopsis.