Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data, for efficient usage in subsequent prediction tasks. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing (NLP) algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose, which is the most extensively used string-based encoding for molecules. However, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based chemical language models, on predicting aqueous solubility of molecules and adverse drug reactions, while producing comparable results for the remaining tasks. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.
No abstract
Transmembrane and Coiled-Coil Domains 1 (TMCO1) protein is encoded by TMCO1 gene consists of 7 exons. Previous studies have identified multiple TMCO1 variants in patients with cerebro-facio-thoracic dysplasia (CFTD) and TMCO1 locus was also shown to be associated with primary open angle glaucoma (POAG). However, there are limited number of research exist reporting associations of the TMCO1 gene sequence variants and majority of the findings affirm the pathogenicity of the nonsense and frameshift TMCO1 variants and their associations with clinical phenotypes. Thus functional properties of the single nucleotide variants causing amino acid changes in the TMCO1 are yet to be comprehensively elucidated. In this study, we evaluated the effects of amino acid substitutions on protein structure, identified their putative roles in post-translational modifications (PTM) and in regulatory mechanism for TMCO1 protein. We classified 41 missense variants as pathogenic based on combined scores of common in silico tools (SIFT, MutationTaster2, Polyphen2). Of these 41 variants, four (p.K211Q, p.K105E, p.S235F, p.K237R) were identified to be located in PTMs and regulatory protein binding sites; thus they were proposed to be putative functional variants. Moreover, rs1387528611 (p.Lys128Gln) had also strong evidence (RegulomeDB score=2b) for its possible regulatory function. The results of our in silico analyses highlight the functional importance of the missense TMCO1 variants that may contribute to the TMCO1-associated disease phenotypes and further in vivo evaluation yet to be needed to uncover their role in human diseases.
Motivation: Identifying unknown functional properties of proteins is an important task for understanding their roles in both health and disease states. The domain composition of a protein can reveal critical information in this context, as domains are structural and functional units that dictate how the protein should act at the molecular level. The expensive and time-consuming nature of wet-lab experimental approaches prompted researchers to develop computational strategies for predicting biomolecular functions. Biological ontologies, such as the Gene Ontology (GO), which provide a standardized vocabulary of information about biological entities, are frequently employed in protein function prediction. Results: In this study, we proposed a new method called Domain2GO that predicts associations between protein domains and GO terms, thus redefining the problem as domain function prediction, using documented protein-level GO annotations together with proteins' domain content. To obtain reliable associations, co-annotation patterns of domains and GO terms in the same proteins are examined using statistical resampling. An ablation study was conducted to compare the predictive performance of various implementations of Domain2GO, differing from each other by the utilized statistical measure (e.g., information theory inspired similarity measures and the ones calculated by the expectation-maximization algorithm). As a use-case study, examples selected from the finalized domain-GO term mappings were evaluated for their biological relevance via a literature review. Then, we applied the proposed method to predict presently unknown protein functions by propagating domain-associated GO terms to proteins annotated with these domains. For protein function prediction performance evaluation and comparison against other methods, we employed Critical Assessment of Function Annotation 3 (CAFA3) challenge datasets. The results demonstrated the high potential of Domain2GO, particularly for predicting molecular function and biological process terms, along with advantages such as producing interpretable results and having exceptionally low computational costs. The approach presented here can be extended to other ontologies and biological entities in order to investigate unknown relationships in complex and large-scale biological data. Availability and implementation: The source code, datasets, results, and user instructions for Domain2GO are available at https://github.com/HUBioDataLab/Domain2GO.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.