Cocrystals are of much interest in industrial application as well as academic research, and screening of suitable coformers for active pharmaceutical ingredients is the most crucial and challenging step in cocrystal development. Recently, machine learning techniques are attracting researchers in many fields including pharmaceutical research such as quantitative structure-activity/property relationship. In this paper, we develop machine learning models to predict cocrystal formation. We extract descriptor values from simplified molecular-input line-entry system (SMILES) of compounds and compare the machine learning models by experiments with our collected data of 1476 instances. As a result, we found that artificial neural network shows great potential as it has the best accuracy, sensitivity, and F1 score. We also found that the model achieved comparable performance with about half of the descriptors chosen by feature selection algorithms. We believe that this will contribute to faster and more accurate cocrystal development.
Malaria remains by far one of the most threatening and dangerous illnesses caused by the plasmodium falciparum parasite. Chloroquine (CQ) and first-line artemisinin-based combination treatment (ACT) have long been the drug of choice for the treatment and controlling of malaria; however, the emergence of CQ-resistant and artemisinin resistance parasites is now present in most areas where malaria is endemic. In this work, we developed five machine learning models to predict antimalarial bioactivities of a drug against plasmodium falciparum from the features (i.e., molecular descriptors values) obtained from PaDEL software from SMILES of compounds and compare the machine learning models by experiments with our collected data of 4794 instances. As a consequence, we found that three models amongst the five, namely artificial neural network (ANN), extreme gradient boost (XGB), and random forest (RF), outperform the others in terms of accuracy while observing that, using roughly a quarter of the promising descriptors picked by the feature selection algorithm, the five models achieved equivalent and comparable performance. Nevertheless, the contribution of all molecular descriptors in the models was investigated through the comparison of their rank values by the feature selection algorithm and found that the most potent and relevant descriptors which come from the ‘Autocorrelation’ module contributed more while the ‘Atom type electrotopological state’ contributed the least to the model.
Pharmaceutical cocrystals of pelubiprofen (PF) were discovered for the first time. 16 candidates to form cocrystals with PF were selected via the ANN model and the pKa rule.
Vonoprazan (VPZ) is the first-in-class potassium-competitive acid blocker (P-CAB), and has many advantages over proton pump inhibitors (PPIs). It is administered as a fumarate salt for the treatment of acid-related diseases, including reflux esophagitis, gastric ulcer, and duodenal ulcer, and for eradication of Helicobacter pylori. To discover novel cocrystals of VPZ, we adopted an artificial neural network (ANN)-based machine learning model as a virtual screening tool that can guide selection of the most promising coformers for VPZ cocrystals. Experimental screening by liquid-assisted grinding (LAG) confirmed that 8 of 19 coformers selected by the ANN model were likely to create new solid forms with VPZ. Structurally similar benzenediols and benzenetriols, i.e., catechol (CAT), resorcinol (RES), hydroquinone (HYQ), and pyrogallol (GAL), were used as coformers to obtain phase pure cocrystals with VPZ by reaction crystallization. We successfully prepared and characterized three novel cocrystals: VPZ–RES, VPZ–CAT, and VPZ–GAL. VPZ–RES had the highest solubility among the novel cocrystals studied here, and was even more soluble than the commercially available fumarate salt of VPZ in solution at pH 6.8. In addition, novel VPZ cocrystals had superior stability in aqueous media than VPZ fumarates, demonstrating their potential for improved pharmaceutical performance.
The rapid development of social networks, electronic commerce, mobile Internet, and other technologies has influenced the growth of Web data. Social media and Internet forums are valuable sources of citizens' opinions, which can be analyzed for community development and user behavior analysis. Unfortunately, the scarcity of resources (i.e., datasets or language models) has become a barrier to the development of natural language processing applications in low-resource languages. Thanks to the recent growth of online forums and news platforms of Swahili, we introduce two datasets of Swahili in this paper: a pre-training dataset of approximately 105MB with 16M words and an annotated dataset of 13K instances for the emotion classification task. The emotion classification dataset is manually annotated by two native Swahili speakers. We pre-trained a new monolingual language model for Swahili, namely SwahBERT, using our collected pre-training data, and tested it with four downstream tasks including emotion classification. We found that SwahBERT outperforms multilingual BERT, a well-known existing language model, in almost all downstream tasks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.