A <scp>co‐training</scp>‐based approach for the hierarchical multi‐label classification of research papers

Masmoudi, Abir; Bellâaj, Hatem; Drira, Khalil; Jmaïel, Mohamed

doi:10.1111/exsy.12613

Cited by 14 publications

(12 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In [15], the author uses a method based on collaborative training to classify papers by layers and labels, then classi es data from two perspectives, and nally, adds the most likely correct classi cation result to the marked data set, which amounts to a semi-supervised classi cation method. In [16], the author uses the LDA topic model to solve the problem of semantic similarity measurement in traditional text classi cation and the K-nearest neighbour algorithm as a classi er employed in the classi cation of samples.…”

Section: Semantic-based Approachmentioning

confidence: 99%

Research on automatic classification of Chinese papers based on LDA model and TF-IDF algorithm

Hao

2023

Preprint

View full text Add to dashboard Cite

To ensure the automatic classification of research directions in designated fields along with the natural semantic machine recognition of research papers, this research proposes an automatic classification method based on topic probability. Initially, the feature words and their topic probabilities are calculated through the LDA topic model, and the weight of feature words in each paper is calculated by means of the TF-IDF algorithm. Finally, the topic probabilities of each feature word in the paper are weighted and normalized, thus accomplishing the topic probability distribution from words to papers. Through unsupervised classification experiments on 53,780 papers in the field of clothing, the accuracy rate reaches 92.4%, and the F-score reaches 85.0% in comparison with professional manual classification. Consequently, the probability classification of paper topics proposed in this research can be directly used to solve the automatic classification of papers and the automatic classification of research directions.

show abstract

Section: Semantic-based Approachmentioning

confidence: 99%

Research on automatic classification of Chinese papers based on LDA model and TF-IDF algorithm

Hao

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Standard co‐training algorithms assume that data have two conditionally independent views (i.e., feature spaces), each of which is sufficient to train a classifier. As a general framework, various kinds of classifiers can easily adapt to co‐training (Han et al, 2018; Masmoudi et al, 2021). However, many tasks can hardly satisfy the strong assumption of two compatible and uncorrelated views.…”

Section: Related Workmentioning

confidence: 99%

A boosted co‐training method for class‐imbalanced learning

2023

View full text Add to dashboard Cite

Class imbalance learning (CIL) has become one of the most challenging research topics. In this article, we propose a Boosted co-training method to modify the class distribution so that traditional classifiers can be readily adapted to imbalanced datasets. This article is among the first to utilize pseudo-labelled data of co-training to enlarge the training set of minority classes. Compared with existing oversampling methods which generate minority samples based on labelled data, the proposed method has the ability to learn from unlabelled data and then decrease the risk of overfitting. Furthermore, we propose a boosting-style technique which implicitly modifies the class distribution and combines it with co-training to alleviate the bias towards majority classes. Finally, we collect two series of classifiers generated during Boosted co-training to build an ensemble for the classification. It further improves the CIL performance by leveraging the strength of ensemble learning. By taking advantage of the diversity of co-training, we also contribute a new approach to generating base classifiers for ensemble learning. The proposed method is compared with eight state-of-the-art CIL methods on a variety of benchmark datasets. Measured by G-Mean, F-Measure, and AUC, Boosted co-training achieves the best performances and average ranks on 18 benchmark datasets. The experimental results demonstrate the significant superiority of Boosted co-training over other CIL methods. K E Y W O R D S boosting, class-imbalanced learning, co-training, over-sampling, pseudo-labelled data | INTRODUCTIONMany real-world classification tasks suffer from the class imbalance problem, where minority classes are highly underrepresented as compared to majority classes. Note that traditional classifiers are designed to output the hypothesis that minimizes the overall prediction error. As a result, they are apt to be biased towards majority classes and thereby perform poorly on minority classes (Kaur et al., 2019). However, minority classes are usually more valuable in real applications, such as fraud detection, medical diagnosis, spam classification, and many others. For example, in rare disease diagnoses, a classifier that identifies all patients as normal cases is useless even if it achieves 99% accuracy. Therefore, learning from class-imbalanced data has become one of the most challenging topics in machine learning. Numerous CIL techniques have been proposed over the past decades, which can be roughly grouped into the following two categories:i. Data-level methods try to preprocess (e.g., oversampling (Chawla et al., 2002) or undersampling (Kubat & Matwin, 1997)) a dataset to make it suitable for standard classification algorithms. This approach is classifier-independent. However, generating a perfectly balanced distribution does not always provide an optimal result for classification tasks (Wu & Chang, 2003).

show abstract

“…La propuesta del modelo de co-entrenamiento es una extensión del auto-entrenamiento, que permite entrenar a dos o más clasificadores a partir de una base de documentos etiquetados para seudo-etiquetar no etiquetados, el fin es compartir las seudo-etiquetas entre los clasificadores buscando mejorar la precisión de la predicción [26]. Para cada clasificador se busca enfoques distintos de las características de los documentos etiquetados denominados vistas, mientras menos correlacionadas estén las características en las vistas mejor será la predicción, por esta razón a este modelo se lo conoce como modelo multivista y genera entrenamiento a través de una red de aprendizaje [27]. En el Algoritmo 2 se presenta la estructura del modelo:…”

Section: Co-entrenamientounclassified

Semi-supervised learning models for document classification: A systematic review and meta-analysis

Cevallos-Culqui,

Pons,

Rodriguez

2023

View full text Add to dashboard Cite

The continuous increase of digital documents on the web creates the need to search for information patterns that allow the categorization of organizational documents to generate knowledge in an institution. An Artificial Intelligence technique for this purpose is text classification, it for its application uses labels (previously categorized documents) with supervised (with labels) or unsupervised (without labels) training models. Both traditional models with their advantages and disadvantages have been joined into semi-supervised models that extract the best qualities of each one, however, the labeling process involves resources and time that try to be optimized to improve classification accuracy. An analysis of the different semi-supervised models would show us the advantages of their training and the way how the structure of each of them affects the accuracy of their classification. In the present study, a classification structure of the semi-supervised models in the classification of documents is proposed to analyze their qualities and categorization process, through an SLR (Revision of systematic literature) that extracts performance metrics from the identified studies to perform a meta-analysis through forest plots. To define the search strategy for studies, the PICOC (Population, Intervention, Comparison, Outputs, Context) method has been used, it is supported by the research question defines a search string, which has allowed the collection of 228 research, these are filtered with the PRISMA declaration method and the determination of exclusion criteria, in this way 35 researches are selected for the present study. The analysis of the selected studies identifies a structure for the different semi-supervised learning models, and a scheme of their work process is obtained, it has been used to extract advantages, disadvantages, and performance metrics. Through a meta-analysis with forest diagrams, the classification accuracy performance of the researches in each learning model is evaluated, determining as results that regardless of the characteristics of its process, active learning (0.89) and assembled learning (0.83) present the best performance levels.

show abstract

A co‐training‐based approach for the hierarchical multi‐label classification of research papers

Cited by 14 publications

References 49 publications

Research on automatic classification of Chinese papers based on LDA model and TF-IDF algorithm

Research on automatic classification of Chinese papers based on LDA model and TF-IDF algorithm

A boosted co‐training method for class‐imbalanced learning

Semi-supervised learning models for document classification: A systematic review and meta-analysis

Contact Info

Product

Resources

About