Search and classify topics in a corpus of text using the latent dirichlet allocation model

Iparraguirre-Villanueva, Orlando; Sierra-Liñan, Fernando; Salazar, José Luis Herrera; Beltozar-Clemente, Saul; Pucuhuayla-Revatta, Félix; Zapata-Paulini, Joselyn; Cabanillas-Carbonell, Michael

doi:10.11591/ijeecs.v30.i1.pp246-256

Cited by 5 publications

(5 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…severe side effect), from the patient's perspective on the drug under consideration. For a pair of two linked nodes representing two patients, the thickness of the link between them indicates how comparable their semantic content is, which is determined by adding up all the words (terms) in both reviews that appear to be about the same topic and have non-zero weights for TF-IDF (see [55]− [58]). Green links are those that provide support (are in favor) and red links are those that provide opposition (are against).…”

Section: Ppis Graph Generationmentioning

confidence: 99%

Patient-patient interactions visualization for drug side effects in patients’ reviews

Salah,

Elsoud,

Salah

et al. 2024

IJEECS

View full text Add to dashboard Cite

This paper describes the patient-patient interactions (PPIs) graph extraction framework from patient’s review transcripts. The concept is to visualise patients as nodes and interactions representing links. Links are made based on review text similarity. Nodes are categorized as positive or negative according to the patient’s attitude toward a given drug. Attitudes are then utilized to categorize the links as supporting or opposing the use of a certain drug. If both patients share the same attitude: negative (severe side effect) or positive (moderate side effect), the relationship is considered supportive; if not, the link is considered opposed. Resulting graph represent a drug as a dispute between two factions arguing on related drug. The framework is explained and evaluated using a dataset included 3,763 patients’ reviews linked to 255 different drugs, -predictive-value (0.37). We argue that, this is caused by derogatory jargon that is an expected feature of patient’s review. The true-negative-recognition-rate is 0.70 and true-positive-recognition-rate is 0.54. Total-average-accuracy, which is independent of class priors, is 0.66. Results show that, it is possible to use text proximity measures and sentiment analysis to capture PPIs structure.

show abstract

Section: Ppis Graph Generationmentioning

confidence: 99%

Patient-patient interactions visualization for drug side effects in patients’ reviews

Salah,

Elsoud,

Salah

et al. 2024

IJEECS

View full text Add to dashboard Cite

show abstract

“…searching and classifying topics in a text corpus [30], improving document classification using domainspecific vocabulary [31], and customer opinion mining using Twitter topic modeling and logistic regression [32]. While applicable to a large corpus of documents, LDA makes some rigid assumptions regarding a corpus, suggesting areas for improvisation.…”

Section: Generate Lda Modelsmentioning

confidence: 99%

A data-driven analysis to determine the optimal number of topics 'K' for latent Dirichlet allocation model

Goyal,

Kashyap

2024

IJEECS

View full text Add to dashboard Cite

Topic modeling is an unsupervised machine learning technique successfully used to classify and retrieve textual data. However, the performance of topic models is sensitive to selecting optimal hyperparameters, the number of topics 'K' and Dirichlet priors 'α' and 'β.' This data-driven analysis aims to determine the optimum number of topics, 'K,' within the latent Dirichlet allocation (LDA) model. This work utilizes three datasets, namely 20-Newsgroups news articles, Wikipedia articles, and Web of Science containing science articles, to assess and compare various 'K' values through the grid search approach. The grid search approach finds the best combination of hyperparameter values by trying all possible combinations to see which performs best. This research seeks to identify the 'K' that optimizes topic relevance, coherence, and model performance by leveraging statistical metrics, such as coherence scores, perplexity, and topic distribution quality. Through empirical analysis and rigorous evaluation, this work provides valuable insights for determining the ideal 'K' for LDA models.

show abstract

“…Finally, the MLP model is characterized as one of the best predictors, this predictor learns a feature from a set of inputs and combines the different features in a set of outputs, the performance rate of this model has been 99%, and it is a result with a high pre-accuracy rate, which allows it to be a reliable option for the prediction of breast cancer. Also, [20], [21] used this model with three clinical factors: age, cancer cell type, and cell surface receptors, obtaining satisfactory results, with a performance rate of 98%. The summary of the analysis of the 6 models used in this work to predict breast cancer is presented in Table V.…”

Section: J Model Training and Testingmentioning

confidence: 99%

“…Using features associated with cancer cell imaging, breast cancer can be predicted using ML models. This field of action is in constant development from two deans to after [19], [20].…”

Section: Introductionmentioning

confidence: 99%

Breast Cancer Prediction using Machine Learning Models

Iparraguirre-Villanueva¹,

Epifanía-Huerta²,

Torres-Ceclén³

et al. 2023

IJACSA

View full text Add to dashboard Cite

Breast cancer is a type of cancer that develops in the cells of the breast. Treatment for breast cancer usually involves X-ray, chemotherapy, or a combination of both treatments. Detecting cancer at an early stage can save a person's life. Artificial intelligence (AI) plays a very important role in this area. Therefore, predicting breast cancer remains a very challenging issue for clinicians and researchers. This work aims to predict the probability of breast cancer in patients. Using machine learning (ML) models such as Multilayer Perceptron (MLP), K-Nearest Neightbot (KNN), AdaBoost (AB), Bagging, Gradient Boosting (GB), and Random Forest (RF). The breast cancer diagnostic medical dataset from the Wisconsin repository has been used. The dataset includes 569 observations and 32 features. Following the data analysis methodology, data cleaning, exploratory analysis, training, testing, and validation were performed. The performance of the models was evaluated with the parameters: classification accuracy, specificity, sensitivity, F1 count, and precision. The training and results indicate that the six trained models can provide optimal classification and prediction results. The RF, GB, and AB models achieved 100% accuracy, outperforming the other models. Therefore, the suggested models for breast cancer identification, classification, and prediction are RF, GB, and AB. Likewise, the Bagging, KNN, and MLP models achieved a performance of 99.56%, 95.82%, and 96.92%, respectively. Similarly, the last three models achieved an optimal yield close to 100%. Finally, the results show a clear advantage of the RF, GB, and AB models, as they achieve more accurate results in breast cancer prediction.

show abstract

Search and classify topics in a corpus of text using the latent dirichlet allocation model

Cited by 5 publications

References 24 publications

Patient-patient interactions visualization for drug side effects in patients’ reviews

Patient-patient interactions visualization for drug side effects in patients’ reviews

A data-driven analysis to determine the optimal number of topics 'K' for latent Dirichlet allocation model

Breast Cancer Prediction using Machine Learning Models

Contact Info

Product

Resources

About