We investigate whether a neural, encoderdecoder translation system learns syntactic information on the source side as a by-product of training. We propose two methods to detect whether the encoder has learned local and global source syntax. A fine-grained analysis of the syntactic structure learned by the encoder reveals which kinds of syntax are learned and which are missing.
The de novo design of antimicrobial therapeutics involves the exploration of a vast chemical repertoire to find compounds with broad-spectrum potency and low toxicity. Here, we report an efficient computational method for the generation of antimicrobials with desired attributes. The method leverages guidance from classifiers trained on an informative latent space of molecules modelled using a deep generative autoencoder, and screens the generated molecules using deep-learning classifiers as well as physicochemical features derived from high-throughput molecular dynamics simulations. Within 48 days, we identified, synthesized and experimentally tested 20 candidate antimicrobial peptides, of which two displayed high potency against diverse Gram-positive and Gram-negative pathogens (including multidrug-resistant Klebsiella pneumoniae) and a low propensity to induce drug resistance in Escherichia coli. Both peptides have low toxicity, as validated in vitro and in mice. We also show using live-cell confocal imaging that the bactericidal mode of action of the peptides involves the formation of membrane pores. The combination of deep learning and molecular dynamics may accelerate the discovery of potent and selective broad-spectrum antimicrobials.
We introduce a new approach to tackle the problem of offensive language in online social media. Our approach uses unsupervised text style transfer to translate offensive sentences into non-offensive ones. We propose a new method for training encoderdecoders using non-parallel data that combines a collaborative classifier, attention and the cycle consistency loss. Experimental results on data from Twitter and Reddit show that our method outperforms a state-of-the-art text style transfer system in two out of three quantitative metrics and produces reliable non-offensive transferred sentences.
Predicting the properties of a chemical molecule is of great importance in many applications, including drug discovery and material design. Machine learning-based models promise to enable more accurate and faster molecular property predictions than the current state-of-the-art techniques, such as Density Functional Theory calculations or wet-lab experiments. Various supervised machine learning models, including graph neural nets, have demonstrated promising performance in molecular property prediction tasks. However, the vast chemical space and the limited availability of property labels make supervised learning challenging, calling for learning a general-purpose molecular representation. Recently, unsupervised transformerbased language models pre-trained on large unlabeled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabeled molecules from the PubChem and ZINC datasets.Experiments show that utilizing the learned molecular representation outperforms existing baselines on downstream tasks, including supervised and self-supervised graph neural net baselines and language models, on several classification and regression tasks from ten benchmark datasets while performing competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that the large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties. MainMachine Learning (ML) has emerged as an appealing, computationally efficient approach for predicting molecular properties, with implications in drug discovery and material engineering. ML models for molecules can be trained directly on pre-defined chemical descriptors, such as unsupervised molecular fingerprints 1 , or hand-derived derivatives of geometric features such as a Coulomb Matrix (CM) 2 . However, more recent ML models have focused on automatically learning the features either from the natural graphs that encode the connectivity information or from the line annotations of molecular structures, such as the popular SMILES 3 (Simplified Molecular-Input Line Entry System) representation. SMILES defines a character string representation of a molecule by performing a depth-first pre-order spanning tree traversal of the molecular graph, generating symbols for each atom, bond, tree-traversal decision, and broken cycles. Therefore, the resulting character string corresponds to a flattening of a spanning tree of the molecular graph. Learning on SMIL...
Given the emerging global threat of antimicrobial resistance, new methods for next-generation antimicrobial design are urgently needed. We report a peptide generation framework PepCVAE, based on a semi-supervised variational autoencoder (VAE) model, for designing novel antimicrobial peptide (AMP) sequences. Our model learns a rich latent space of the biological peptide context by taking advantage of abundant, unlabeled peptide sequences. The model further learns a disentangled antimicrobial attribute space by using the feedback from a jointly trained AMP classifier that uses limited labeled instances. The disentangled representation allows for controllable generation of AMPs. Extensive analysis of the PepCVAE-generated sequences reveals superior performance of our model in comparison to a plain VAE, as PepCVAE generates novel AMP sequences with higher long-range diversity, while being closer to the training distribution of biological peptides.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.