A Comparative Survey of Instance Selection Methods applied to Non-Neural and Transformer-Based Text Classification

Cunha, Washington; Viegas, Felipe; França, Celso Aparecido de; Rosa, Thierson Couto; Rocha, Leonardo Cristian; Gonçalves, Marcos André

doi:10.1145/3582000

Cited by 21 publications

(7 citation statements)

References 61 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Four representative models are selected for evaluation: TextRNN, Transformer (Cunha et al, 2023), Bert with size of base (Pérez Pozo et al, 2022;Wang et al, 2022;Wang et al, 2024) and LLAMA2 with size of 7B (Touvron et al, 2023). These models, widely acclaimed and adopted, collectively embody distinct stages in the progression of deep learning, presenting a rich diversity.…”

Section: Experiments Settingmentioning

confidence: 99%

Will sentiment analysis need subculture? A new data augmentation approach

Wang,

He,

et al. 2024

Asso for Info Science & Tech

View full text Add to dashboard Cite

Nowadays, the omnipresence of the Internet has fostered a subculture that congregates around the contemporary milieu. The subculture artfully articulates the intricacies of human feelings by ardently pursuing the allure of novelty, a fact that cannot be disregarded in the sentiment analysis. This paper aims to enrich data through the lens of subculture, to address the insufficient training data faced by sentiment analysis. To this end, a new approach of subculture‐based data augmentation (SCDA) is proposed, which engenders enhanced texts for each training text by leveraging the creation of specific subcultural expression generators. The extensive experiments attest to the effectiveness and potential of SCDA. The results also shed light on the phenomenon that disparate subcultural expressions elicit varying degrees of sentiment stimulation. Moreover, an intriguing conjecture arises, suggesting the linear reversibility of certain subcultural expressions.

show abstract

Section: Experiments Settingmentioning

confidence: 99%

Will sentiment analysis need subculture? A new data augmentation approach

Wang,

He,

et al. 2024

Asso for Info Science & Tech

View full text Add to dashboard Cite

show abstract

“…The following studies evaluate various IS methods for various domains and contexts. Cunha et al [26] Key evaluation metrics used in their analysis include the reduction (R) mean, Macro Averaged F1, and speedup of training times. This comprehensive approach not only highlights the significant potential of IS in modern text classification tasks but also provides empirical evidence that specific IS methods can effectively streamline the training process without compromising the effectiveness of complex machine learning models.…”

Section: Related Workmentioning

confidence: 99%

Graph Reduction Techniques for Instance Selection: Comparative and Empirical Study

Rustamov,

Zaki,

Rustamov

et al. 2024

Preprint

View full text Add to dashboard Cite

Background: The surge in data generation has led to a paradigm shift towards big data, where the belief that “more data equals better performance” is challenged by limitations in processing capabilities and time. In this evolving landscape of machine learning and artificial intelligence, instance selection (IS) has become a crucial technique for data reduction that does not compromise the quality of machine learning models. Traditional IS methods, while efficient, often struggle with the complexity and size of large datasets encountered in data mining. Objective: This study aims to review and evaluate graph reduction techniques, grounded in graph theory, as a novel approach for instance selection. The objective is to leverage the inherent structures of data represented as graphs to enhance the effectiveness of instance selection. Methods: We conducted a comprehensive evaluation of 35 graph reduction techniques across 29 diverse classification datasets. These techniques were assessed based on various metrics, including accuracy, F1 score, reduction rate, and computational times. The study spans a wide array of techniques and compares their performance to provide a thorough understanding of their suitability for different data sizes and types. Results: The evaluation revealed significant potential in graph reduction methods, particularly in maintaining data integrity while achieving substantial reductions. The performance metrics indicate that these techniques can be highly effective, offering substantial improvements in various aspects of instance selection. Conclusion: This research contributes to the theoretical framework of graph-based instance selection and provides practical guidelines for applying these techniques in real-world scenarios. Our findings suggest that graph reduction methods are promising for preserving data quality and enhancing the efficiency of data processing in large and complex datasets.

show abstract

“…Self attention allows Transformers to easily transmit information across the input sequences. Inspired by [17,18] we have implemented sklearn's transformer architecture sklearn.preprocessing.FunctionTransformer 8 .…”

Section: Transformersmentioning

confidence: 99%

Class-oriented document vectorization as a new vectorization paradigm: application to job recommender systems

Tatchum,

Nzeko'o,

Makembe

et al. 2023

Preprint

View full text Add to dashboard Cite

Nowadays, job recommender systems are more useful in the fight against unemployment due to their strong presence in e-recruitment platforms that are becoming very popular. Most of the job recommender systems based on machine learning models use a vector representation of job offers based on keywords. However, these keywords are results of vectorization which is applied on a collection of documents where each one is a job offer. In this case, each keyword discriminates one job offer from another, whereas it can be preferable that each keyword discriminates one class from another. Our aims is to improve job recommendation to user profiles, by applying vectorization on a class-oriented collection of documents in order to obtain more useful keywords for job offer representation. In this context, each class-oriented document corresponds to a user profile. Experiments are done on three datasets (Monster, Nigam and Minajobs), using TF-IDF and Doc2Vec as vectorization techniques, Naive Bayes, Decision Trees, Support Vector Machine (SVM), and Tranformers erchitecture (TFM) as machine learning models for top-N recommendation, and Precision, MAP and F1-score as evaluation metrics. Results show that, compared to classic job recommender systems, improvement rates can go up to 19\%, 22\% and 20\% for systems based on Naive Bayes, go up to 34\%, 37\% and 34\% for those based on Decision tree, go up to 33\%, 34\%, 34\% for those based on SVM, and go up to 29\%, 40\% and 33\% for those based on transformers architecture, respectively in the Monster, Nigam and Minajobs datasets.

show abstract

A Comparative Survey of Instance Selection Methods applied to Non-Neural and Transformer-Based Text Classification

Cited by 21 publications

References 61 publications

Will sentiment analysis need subculture? A new data augmentation approach

Will sentiment analysis need subculture? A new data augmentation approach

Graph Reduction Techniques for Instance Selection: Comparative and Empirical Study

Class-oriented document vectorization as a new vectorization paradigm: application to job recommender systems

Contact Info

Product

Resources

About