By researching all kinds of methods for document clustering, we put forward a new dynamic method based on genetic algorithm (GA). K-means is a greedy algorithm, which is sensitive to the choice of cluster center and very easily results in local optimization. Genetic algorithm is a global convergence algorithm, which can find the best cluster centers easily. Among the traditional document clustering methods, the document similar matrix is a sparse matrix. In this paper, we propose some new formulas improved on the traditional method. Then, we make some improvement on genetic algorithm. All individuals are encoded by floating-point number and the sum of mean square deviation of intra-class distance is adopted as the objective function. The steps of the algorithm are given in detail. The experimental results show that the accuracy of GA can reach over 98 percent and generate better clustering result than K-means.
BackgroundTraditional Chinese Medicine (TCM) is a style of traditional medicine informed by modern medicine but built on a foundation of more than 2500 years of Chinese medical practice. According to statistics, TCM accounts for approximately 14% of total adverse drug reaction (ADR) spontaneous reporting data in China. Because of the complexity of the components in TCM formula, which makes it essentially different from Western medicine, it is critical to determine whether ADR reports of TCM should be analyzed independently.MethodsReports in the Chinese spontaneous reporting database between 2010 and 2011 were selected. The dataset was processed and divided into the total sample (all data) and the subsample (including TCM data only). Four different ADR signal detection methods-PRR, ROR, MHRA and IC- currently widely used in China, were applied for signal detection on the two samples. By comparison of experimental results, three of them—PRR, MHRA and IC—were chosen to do the experiment. We designed several indicators for performance evaluation such as R (recall ratio), P (precision ratio), and D (discrepancy ratio) based on the reference database and then constructed a decision tree for data classification based on such indicators.ResultsFor PRR: R1-R2 = 0.72%, P1-P2 = 0.16% and D = 0.92%; For MHRA: R1-R2 = 0.97%, P1-P2 = 0.20% and D = 1.18%; For IC: R1-R2 = 1.44%, P2-P1 = 4.06% and D = 4.72%. The threshold of R,Pand Dis set as 2%, 2% and 3% respectively. Based on the decision tree, the results are “separation” for PRR, MHRA and IC.ConclusionsIn order to improve the efficiency and accuracy of signal detection, we suggest that TCM data should be separated from the total sample when conducting analyses.
Adverse drug reactions (ADRs) are the major source of morbidity and mortality. The prediction of drug risk level based on ADRs is few. Our study aims at predicting the drug risk level from ADRs using machine learning approaches. A total of 985,960 ADR reports from 2011 to 2018 were attained from the Chinese spontaneous reporting database (CSRD) in Jiangsu Province. Among them, there were 887 Prescription (Rx) Drugs (84.72%), 113 Over-the-Counter-A (OTC-A) Drugs (10.79%) and 47 OTC-B Drugs (4.49%). An over-sampling method, Synthetic Minority Oversampling Technique (SMOTE), was applied to the imbalanced classification. Firstly, we proposed a multi-classification framework based on SMOTE and classifiers. Secondly, drugs in CSRD were taken as the samples, ADR signal values calculated by proportional reporting ratio (PRR) or information component (IC) were taken as the features. Then, we applied four classifiers: Random Forest (RF), Gradient Boost (GB), Logistic Regression (LR), AdaBoost (ADA) to the tagged data. After evaluating the classification results by specific metrics, we finally obtained the optimal combination of our framework, PRR-SMOTE-RF with an accuracy rate of 0.95. We anticipate that this study can be a strong auxiliary judgment basis for experts on the status change of Rx Drugs to OTC Drugs. INDEX TERMS Adverse drug reaction, drug risk level, imbalanced dataset, multi-classification, machine learning, SMOTE.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.