In metabolic engineering and synthetic biology applications, promoters with appropriate strengths are critical. Promoter traits including excessive sequence length and restricted vocabulary size, are considered to impede the effect of natural language models on tasks involving genetic sequence. We propose EVMP (Extended Vision Mutant Priority framework), which enhances various machine learning models without concern of model structures. The synthetic promoter input to EVMP is split into base promoter and k-mer mutations, which are encoded by BaseEncoder and VarEncoder respectively. We used EVMP on various machine learning models for promoter strength prediction in Trc synthetic promoter library. The MAE (mean absolute error) of LSTM was reduced from 0.50 to 0.18, the MAE of the Transformer was reduced from 0.27 to 0.177, and the MAE of other models was also slightly reduced by EVMP. By virtual dataset expansion formed based on multiple different base promoters, EVMP utilizes ensemble learning and conclusively lowers MAE. Further investigation demonstrated that EVMP can improve performance of machine learning models by lowering the over-smoothing phenomenon caused by the similarity of synthetic promoters. Additionally, only 56% of the original amount of synthetic promoter dataset is needed to obtain an identical previous result when applying EVMP, which reduces the cost of creating the synthetic promoter library. Our research demonstrates that EVMP is versatile and robust for building synthetic sequence machine learning models. The source code is available at https://github.com/Tiny-Snow/EVMP.
IntroductionIn metabolic engineering and synthetic biology applications, promoters with appropriate strengths are critical. However, it is time-consuming and laborious to annotate promoter strength by experiments. Nowadays, constructing mutation-based synthetic promoter libraries that span multiple orders of magnitude of promoter strength is receiving increasing attention. A number of machine learning (ML) methods are applied to synthetic promoter strength prediction, but existing models are limited by the excessive proximity between synthetic promoters.MethodsIn order to enhance ML models to better predict the synthetic promoter strength, we propose EVMP(Extended Vision Mutant Priority), a universal framework which utilize mutation information more effectively. In EVMP, synthetic promoters are equivalently transformed into base promoter and corresponding k-mer mutations, which are input into BaseEncoder and VarEncoder, respectively. EVMP also provides optional data augmentation, which generates multiple copies of the data by selecting different base promoters for the same synthetic promoter.ResultsIn Trc synthetic promoter library, EVMP was applied to multiple ML models and the model effect was enhanced to varying extents, up to 61.30% (MAE), while the SOTA(state-of-the-art) record was improved by 15.25% (MAE) and 4.03% (R2). Data augmentation based on multiple base promoters further improved the model performance by 17.95% (MAE) and 7.25% (R2) compared with non-EVMP SOTA record.DiscussionIn further study, extended vision (or k-mer) is shown to be essential for EVMP. We also found that EVMP can alleviate the over-smoothing phenomenon, which may contributes to its effectiveness. Our work suggests that EVMP can highlight the mutation information of synthetic promoters and significantly improve the prediction accuracy of strength. The source code is publicly available on GitHub: https://github.com/Tiny-Snow/EVMP.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.