2021
DOI: 10.1038/s41598-021-01460-7
|View full text |Cite
|
Sign up to set email alerts
|

Multi-label classification of research articles using Word2Vec and identification of similarity threshold

Abstract: Every year, around 28,100 journals publish 2.5 million research publications. Search engines, digital libraries, and citation indexes are used extensively to search these publications. When a user submits a query, it generates a large number of documents among which just a few are relevant. Due to inadequate indexing, the resultant documents are largely unstructured. Publicly known systems mostly index the research papers using keywords rather than using subject hierarchy. Numerous methods reported for perform… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 25 publications
(14 citation statements)
references
References 31 publications
0
14
0
Order By: Relevance
“…The smaller the hamming loss, the better the model performance . We also used average precision and average recall, where the partially correct concept is considered to calculate the average for all the samples. , We used the Exact Match Accuracy, where the result would be considered correct when the predicted set of labels exactly matches the true label for each sample. , We also calculated the AUROC score for each PFAS and calculated the average AUROC for each multilabel model (equation in SI Table S6). The development and evaluation of ML models were coded with the sklearn, xgboost, CatBoost, lightgbm, and PyTorch (TabNet) packages in Python.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The smaller the hamming loss, the better the model performance . We also used average precision and average recall, where the partially correct concept is considered to calculate the average for all the samples. , We used the Exact Match Accuracy, where the result would be considered correct when the predicted set of labels exactly matches the true label for each sample. , We also calculated the AUROC score for each PFAS and calculated the average AUROC for each multilabel model (equation in SI Table S6). The development and evaluation of ML models were coded with the sklearn, xgboost, CatBoost, lightgbm, and PyTorch (TabNet) packages in Python.…”
Section: Methodsmentioning
confidence: 99%
“…42 We also used average precision and average recall, where the partially correct concept is considered to calculate the average for all the samples. 43,44 We used the Exact Match Accuracy, where the result would be considered correct when the predicted set of labels exactly matches the true label for each sample. 45,42 We also calculated the AUROC score for each PFAS and calculated the average AUROC for each multilabel model (equation in SI Table S6).…”
Section: Data Preprocessing To Train a Machine Learning (Ml) Modelmentioning
confidence: 99%
“…Once documents are tokenized, a text feature extraction method is applied to obtain the most distinguishing features of a text, reducing dimensionality [35]- [38]. Some of the broadly feature extraction techniques in research article classification approaches are: 1) One Hot Encoding, 2) Bag of Word (BOW) or Term Frequency (TF), and 3) Term Frequency and Inverse Document Frequency (TF-IDF), and semantic based approaches are: 1) Glove, 2) FastText, and 3) Word2Vec [24].…”
Section: Related Workmentioning
confidence: 99%
“…In multi-label text classification, the goal is to associate one or more labels to the input text. It is an important task that has applications in many tasks such as research article classification and metadata generation from documents [Mustafa et al 2021;Sajid et al 2011] that can be used for optimizing search engine indexing.…”
Section: Introductionmentioning
confidence: 99%