Identifying constitutive articles of cumulative dissertation theses by bilingual text similarity. Evaluation of similarity methods on a new short text task

Donner, Paul

doi:10.1162/qss_a_00152

Cited by 4 publications

(4 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…FastText: is a free public-source library created by the Facebook AI Research team for learning WE and classification [44,45]. For each word, FastText generates a word vector that contains both the term's meaning and its context in the document.…”

Section: Feature Extractionmentioning

confidence: 99%

Word Embedding for High Performance Cross-Language Plagiarism Detection Techniques

Bouaine,

Faouzia Benabbou,

Imane Sadgali

2023

Int. J. Interact. Mob. Technol.

View full text Add to dashboard Cite

Academic plagiarism has become a serious concern as it leads to the retardation of scientific progress and violation of intellectual property. In this context, we make a study aiming at the detection of cross-linguistic plagiarism based on Natural language Preprocessing (NLP), Embedding Techniques, and Deep Learning. Many systems have been developed to tackle this problem, and many rely on machine learning and deep learning methods. In this paper, we propose Cross-language Plagiarism Detection (CL-PD) method based on Doc2Vec embedding techniques and a Siamese Long Short-Term Memory (SLSTM) model. Embedding techniques help capture the text's contextual meaning and improve the CL-PD system's performance. To show the effectiveness of our method, we conducted a comparative study with other techniques such as GloVe, FastText, BERT, and Sen2Vec on a dataset combining PAN11, JRC-Acquis, Europarl, and Wikipedia. The experiments for the Spanish-English language pair show that Doc2Vec+SLSTM achieve the best results compared to other relevant models, with an accuracy of 99.81%, a precision of 99.75%, a recall of 99.88%, an f-score of 99.70%, and a very small loss in the test phase.

show abstract

Section: Feature Extractionmentioning

confidence: 99%

Word Embedding for High Performance Cross-Language Plagiarism Detection Techniques

Bouaine,

Faouzia Benabbou,

Imane Sadgali

2023

Int. J. Interact. Mob. Technol.

View full text Add to dashboard Cite

show abstract

“…For the first stage, the goal is to rule out as many unlikely matches as possible while discarding as few actual matches as possible with relatively simple rules. We only briefly summarize this stage, as a detailed description is already available in Donner (2021b). Scopus records are filtered (a) by author name similarity for all authors affiliated with Germany and (b) publication year range.…”

Section: Filtering Of Candidate Matchesmentioning

confidence: 99%

“…The second criterion is textual similarity. As this is a considerably complex task in its own right, a separate study for the selection of suitable methods was carried out in Donner (2021b) and the reader is referred to this study for a detailed description of the methods. To summarize, the difficulties relate to the sparsity of the textual information, the bilingual nature of the texts, and domain specificity of the texts.…”

Section: Content Similarity Criteriamentioning

confidence: 99%

Algorithmic identification of Ph.D. thesis-related publications: a proof-of-concept study

Donner

2022

Scientometrics

Self Cite

View full text Add to dashboard Cite

In this study we propose and evaluate a method to automatically identify the journal publications that are related to a Ph.D. thesis using bibliographical data of both items. We build a manually curated ground truth dataset from German cumulative doctoral theses that explicitly list the included publications, which we match with records in the Scopus database. We then test supervised classification methods on the task of identifying the correct associated publications among high numbers of potential candidates using features of the thesis and publication records. The results indicate that this approach results in good match quality in general and with the best results attained by the “random forest” classification algorithm.

show abstract

“…In a jobmatching model, understanding context, relationships between words, and deeper meanings are required [19][20] [21]. The SI often has to limit the number of dimensions (semantic concepts) used to represent documents (difficult to interpret) [22][23] [24][25], Sensitivity to changes in documents [14] [26], and cognitive abilities [27] [16], or aspects of the job candidate's personality.…”

Section: Introductionmentioning

confidence: 99%

Word2vec-based Latent Semantic Indexing (Word2Vec-LSI) for Contextual Analysis in Job-Matching Application

Sukri,

Samsudin,

Fadzrin

et al. 2024

IJACSA

View full text Add to dashboard Cite

Job-matching applications have become a technology that provides solutions for making decisions about accepting and looking for work. The contextual analysis of documents or data from job matching is needed to make decisions. Some existing studies on the analysis of job-matching applications can use the Latent Semantic Indexing (LSI) method, which is based on word-to-word comparisons in the text. LSI has the advantage of contextual analysis. It can analyze amounts of data above 10,000 words. However, the conventional LSI method has limitations in contextual analysis because it uses the exact words but different meanings. Therefore, this paper proposes a new technique called word2vec-based latent semantic indexing (Word2vec-LSI) for contextual analysis, which is based on gensim as a multi-language word library. Then, modeling in text and wordnet and stopword as basic text modeling. We then used word2vec-LSI to perform contextual analysis based on the Irish (IE), Swedish (SE), and United Kingdom (UK) languages in the dataset (Jobs on CareerBuilder UK). The results of applying conventional LSI have an accuracy level of 79%, recall has a value of 79%, precision has a value of 62%, and Fi-Scor has a value of 70% with a similarity level of up to 50%. After implementing word2vec-LSI, it can increase accuracy, recall, and precision, and Fi-Scor both have 84% in contextual analysis, and the similarity level reaches up to 95%. Experiments confirm the usefulness of word2vec-LSI in increasing accuracy for contextual analysis applicable in natural language text mining.

show abstract

Identifying constitutive articles of cumulative dissertation theses by bilingual text similarity. Evaluation of similarity methods on a new short text task

Cited by 4 publications

References 38 publications

Word Embedding for High Performance Cross-Language Plagiarism Detection Techniques

Word Embedding for High Performance Cross-Language Plagiarism Detection Techniques

Algorithmic identification of Ph.D. thesis-related publications: a proof-of-concept study

Word2vec-based Latent Semantic Indexing (Word2Vec-LSI) for Contextual Analysis in Job-Matching Application

Contact Info

Product

Resources

About