Drug discovery for a protein target is a very laborious, long and costly process. Machine learning approaches and, in particular, deep generative networks can substantially reduce development time and costs. However, the majority of methods imply prior knowledge of protein binders, their physicochemical characteristics or the three-dimensional structure of the protein. The method proposed in this work generates novel molecules with predicted ability to bind a target protein by relying on its amino acid sequence only. We consider target-specific de novo drug design as a translational problem between the amino acid “language” and simplified molecular input line entry system representation of the molecule. To tackle this problem, we apply Transformer neural network architecture, a state-of-the-art approach in sequence transduction tasks. Transformer is based on a self-attention technique, which allows the capture of long-range dependencies between items in sequence. The model generates realistic diverse compounds with structural novelty. The computed physicochemical properties and common metrics used in drug discovery fall within the plausible drug-like range of values.
Drug discovery for the protein target is a very laborious, long and costly process. Machine learning approaches, and deep generative networks in particular, can substantially reduce development time and costs. However, the majority of methods imply prior knowledge of protein binders, their physicochemical characteristics or three-dimensional structure of the protein. The method proposed in this work generates novel molecules with predicted ability to bind target protein relying on its amino acid sequence only. We consider target specific de novo drug design as a translational problem between amino acid "language" and SMILE (Simplified Molecular Input Line Entry System) representation of the molecule. To tackle this problem, we apply Transformer neural network architecture, the state-of-the-art approach in sequence transduction tasks. The Transformer is based on a self-attention technique which allows capturing long-range dependencies between items in sequence. The model generates realistic diverse compounds with structural novelty. The computed physicochemical properties and common metrics used in drug discovery fall within the plausible drug-like range of values.Most of the deep learning models for molecule generation are based on recurrent neural network (RNN). RNN is commonly used for modeling sequence data. The main feature of RNN allowing it to work with sequential data is the ability to make use of information from preceding steps. RNN can reveal links between distant elements of a sequence [8]. Unfortunately, RNNs suffer from the problem of vanishing gradients which significantly limits the ability to work with long sequences.
BackgroundIn the process of retrotransposition LINEs use their own machinery for copying and inserting themselves into new genomic locations, while SINEs are parasitic and require the machinery of LINEs. The exact mechanism of how a LINE-encoded reverse transcriptase (RT) recognizes its own and SINE RNA remains unclear. However it was shown for the stringent-type LINEs that recognition of a stem-loop at the 3′UTR by RT is essential for retrotransposition. For the relaxed-type LINEs it is believed that the poly-A tail is a common recognition element between LINE and SINE RNA. However polyadenylation is a property of any messenger RNA, and how the LINE RT recognizes transposon and non-transposon RNAs remains an open question. It is likely that RNA secondary structures play an important role in RNA recognition by LINE encoded proteins.ResultsHere we selected a set of L1 and Alu elements from the human genome and investigated their sequences for the presence of position-specific stem-loop structures. We found highly conserved stem-loop positions at the 3′UTR. Comparative structural analyses of a human L1 3′UTR stem-loop showed a similarity to 3′UTR stem-loops of the stringent-type LINEs, which were experimentally shown to be recognized by LINE RT. The consensus stem-loop structure consists of 5–7 bp loop, 8–10 bp stem with a bulge at a distance of 4–6 bp from the loop. The results show that a stem loop with a bulge exists at the 3′-end of Alu. We also found conserved stem-loop positions at 5′UTR and at the end of ORF2 and discuss their possible role.ConclusionsHere we presented an evidence for the presence of a highly conserved 3′UTR stem-loop structure in L1 and Alu retrotransposons in the human genome. Both stem-loops show structural similarity to the stem-loops of the stringent-type LINEs experimentally confirmed as essential for retrotransposition. Here we hypothesize that both L1 and Alu RNA are recognized by L1 RT via the 3′-end RNA stem-loop structure. Other conserved stem-loop positions in L1 suggest their possible functions in protein-RNA interactions but to date no experimental evidence has been reported.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-016-3344-4) contains supplementary material, which is available to authorized users.
The potential of construction of machine learning models was considered as applied to water level forecasting in mountain river reaches in Krasnodar Krai based on observation data on water levels at automated hydrological complexes of the Automated System of Flood Conditions Monitoring in Krasnodar Krai. The study objects were two mountain rivers in Krasnodar Krai―the Pshish and Mzymta. These rivers flow in different natural conditions and differ in their water regimes and the character of lateral inflow in the reaches under consideration. The study focused on three widely used machine learning architectures: regression model of decision trees M5P, gradient boosting of decision trees XGBoost, and artificial neural network based on multilayer perceptron. The forecast quality was evaluated for lead time from 1 to 20 h; variations for rivers with different water regimes and the potential of the examined models as applied to operational forecasting are discussed. The optimal lead time for the Pshish river was found to be 15–18 h (with S/σΔ varying within 0.38–0.39 for XGBoost model); the simulation quality for the Mzymta river is evaluated as good; however, the necessary forecast efficiency is not attained (at the lead time of 5 h, we have S/σΔ = 0.87 for MLP model). The obtained results allow the machine learning models to be regarded as acceptable for short-term hydrological forecasting based on high-frequency water level observation data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.