Coronary artery disease (CAD) is a category of cardiovascular disease that causes the highest mortality rate in the world. CAD occurs due to plaque build-up on the walls of the arteries that supply blood to the heart and other organs of the body. To control the mortality rate, a practical model that is capable of predicting CAD is needed. Machine learning approaches have been used in solving various problems in various domains, including biomedicine. However, real-world data often has an unbalanced class distribution that can interfere with classifier performance. In addition, data has many features to process. This study focuses on effective modeling capable of predicting CAD using feature selection to handle high dimensional data and feature resampling to handle unbalanced data. Feature selection is very effective by eliminating irrelevant features from the training data. Hyperparameter tuning is also done to find the best combination of parameters in support vector machines (SVM). Our results show that the SVM cross-validated ten times has a more accurate training result. Furthermore, the grid search on SVM cross-validated ten times had more accurate training model results and achieved 88% accuracy on the test data.
One of the public e-government services is a web-based online complaints portal. Text of complaint needs to be classified so that it can be forwarded to the responsible office quickly and accurately. The standard classification approach commonly used is the Naive Bayes Classifier (NBC) and k-Nearest Neighbor (k-NN), which still classifies one label and needs to be optimized. This research aims to classify the complaint text of more than one label at the same time with NBC, which is optimized using Particle Swarm Optimization (PSO). The data source comes from the Sambat Online portal and is divided into 70 % as training data and 30 % as testing data to be classified into seven labels. NBC and k-NN algorithms are used as a comparison method to find out the performance of PSO optimization. The 10-fold cross-validation shows that NBC optimization using PSO achieves an accuracy of 87.44 % better than k-NN of 75 % and NBC of 64.38 %. The optimization model can be used to increase the effectiveness of services to e-government in society.
The rise of big data analytics on top of NLP increasing the computational burden for text processing at scale. The problems faced in NLP are very high dimensional text, so it takes a high computation resource. The MapReduce allows parallelization of large computations and can improve the efficiency of text processing. This research aims to study the effect of big data processing on NLP tasks based on a deep learning approach. We classify a big text of news topics with fine-tuning BERT used pre-trained models. Five pre-trained models with a different number of parameters were used in this study. To measure the efficiency of this method, we compared the performance of the BERT with the pipelines from Spark NLP. The result shows that BERT without Spark NLP gives higher accuracy compared to BERT with Spark NLP. The accuracy average and training time of all models using BERT is 0.9187 and 35 minutes while using BERT with Spark NLP pipeline is 0.8444 and 9 minutes. The bigger model will take more computation resources and need a longer time to complete the tasks. However, the accuracy of BERT with Spark NLP only decreased by an average of 5.7%, while the training time was reduced significantly by 62.9% compared to BERT without Spark NLP.CCS CONCEPTS • Computing methodologiesàArtificial intelligenceàNatural language processing • Computing methodologiesàParallel computing methodologiesàParallel algorithms.
Background: Term-weighting plays a key role in detecting emotion in texts. Studies in term-weighting schemes aim to improve short text classification by distinguishing terms accurately. Objective: This study aims to formulate the best term-weighting schemes and discover the relationship between n-gram combinations and different classification algorithms in detecting emotion in Twitter texts. Methods: The data used was the Indonesian Twitter Emotion Dataset, with features generated through different n-gram combinations. Two approaches assign weights to the features. Tests were carried out using ten-fold cross-validation on three classification algorithms. The performance of the model was measured using accuracy and F1 score. Results: The term-weighting schemes with the highest performance are Term Frequency-Inverse Category Frequency (TF-ICF) and Term Frequency-Relevance Frequency (TF-RF). The scheme with a supervised approach performed better than the unsupervised one. However, we did not find a consistent advantage as some of the experiments found that Term Frequency-Inverse Document Frequency (TF-IDF) also performed exceptionally well. The traditional TF-IDF method remains worth considering as a term-weighting scheme. Conclusion: This study provides recommendations for emotion detection in texts. Future studies can benefit from dealing with imbalances in the dataset to provide better performance. Keywords: Emotion Detection, Feature Engineering, Term-Weighting, Text Mining
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.