Specialized document embeddings for aspect-based similarity of research papers

Ostendorff, Malte; Blume, Till; Ruas, Terry; Gipp, Béla; Rehm, Georg

doi:10.1145/3529372.3530912

Cited by 10 publications

(10 citation statements)

References 69 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The shallow notion of what composes paraphrases used by these systems limits their understanding of the task and makes it challenging to interpret detection decisions in practice. For example, although high structural and grammatical similarities can indicate plagiarism, detection systems are often not concerned with the aspects that make two texts segments alike (Ostendorff et al, 2022;Wahle et al, 2022a).…”

Section: Introductionmentioning

confidence: 99%

Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection

Wahle

Ruas

Meuschke

et al. 2023

Preprint

View full text Add to dashboard Cite

Neural language models such as BERT allow for human-like text paraphrasing. This ability threatens academic integrity, as it aggravates identifying machine-obfuscated plagiarism. We make two contributions to foster the research on detecting these novel machine-paraphrases. First, we provide the first large-scale dataset of documents paraphrased using the Transformer-based models BERT, RoBERTa, and Longformer. The dataset includes paragraphs from scientific papers on arXiv, theses, and Wikipedia articles and their paraphrased counterparts (1.5M paragraphs in total). We show the paraphrased text maintains the semantics of the original source. Second, we benchmark how well neural classification models can distinguish the original and paraphrased text. The dataset and source code of our study are publicly available.

show abstract

Section: Introductionmentioning

confidence: 99%

Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection

Wahle

Ruas

Meuschke

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…For example, Konaray SK et al used some experiments to analyze file extension features and used machine learning algorithms to match file suffixes and three-byte magic header information of files to achieve accurate classification of file types [1].Josh Angichiodo et al proposed the use of generative adversarial networks (SGAN) to identify file contents by means of the problem of classifying hidden files and files after their extensions or headers have been obfuscated, this adversarial network performs supervised learning based on the contents to achieve the file classification work [2]. The second one is the clustering of document content and its similarity.Malte Ostendorff,Till Blume et al have proposed Specialized Document Embeddings for Aspect-based Similarity of Research Papers [3], which represents the content of a document as a generic embedding. It treats the similarity of documents in a given aspect as a vector similarity problem in a specific embedding space.…”

Section: Introductionmentioning

confidence: 99%

“…Content-based document classification is implemented using a specialized document embedding design. In addition to these classification studies for physical documents, there are many classification methods that focus only on data content, for example, Ren F et al vectorized text content using Word2Vec word embedding method or Bert model method, and then classified text content using a bidirectional long short-term memory recurrent neural network (BiLSTM) [3]. Kang Chen and Huazheng Fu analyzed the lexical features of short paths and transformed the path word embedding features into feature images using recurrent neural networks (RNN) on the dataset [4].…”

Section: Introductionmentioning

confidence: 99%

Application of a multimodal model optimized by multi-head-attention mechanism in the classification of fishery resource remote sensing data files

Zhang,

Tao

2023

Third International Conference on Advanced Algorithms and Neural Networks (AANN 2023)

View full text Add to dashboard Cite

Remote sensing data is a complex transformation process from data to information, from acquisition to application. These data come from a variety of sources and have a complex structure. It is extremely difficult for different researchers to access each other, which seriously affects the efficiency of scientific research. How to use artificial intelligence technology to manage these scattered files in a systematic multi-level classification is the key to remote sensing data management. In this paper, we propose a specially optimized file classification method based on deep learning technology, which aims to classify the random heterogeneous data generated by satellite remote sensing and expeditions in the field of fisheries expertise by means of artificial intelligence. The main contributions of this paper are threefold: (1) The binary data itself and its summary information in remote sensing data are input as different modalities for classification. (2) The BiLSTM model and the CNN model are improved for remote sensing data classification scenarios based on the multi-headed attention mechanism. (3) Feature extraction is performed based on the improved model and compared with the model of traditional classification methods.(3) The experimental results show that the improved file classification method has higher accuracy than the traditional machine learning classification methods in classifying fisheries remote sensing data files.

show abstract

“…Prior work addresses a similar challenge in the context of document similarity and learns multiple representations associated with different aspects of a paper (e.g. task, method, results) (Mysore et al, 2022;Ostendorff et al, 2022a). In contrast, we aim to learn effective representations for multiple downstream task formats.…”

Section: Introductionmentioning

confidence: 99%

“…We attempt to learn task-specific embeddings of documents by pre-fine-tuning on multiple objectives simultaneously. Ostendorff et al (2022a) and Mysore et al (2022) study the orthogonal task of generating multiple embeddings per paper for different "facets," while we aim to learn general embeddings for mutliple task formats.…”

Section: Introductionmentioning

confidence: 99%

SciRepEval: A Multi-Format Benchmark for Scientific Document Representations

Singh,

D’Arcy,

Cohan

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Learned representations of scientific documents can serve as valuable input features for downstream tasks without further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In response, we introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations. It includes 24 challenging and realistic tasks, 8 of which are new, across four formats: classification, regression, ranking and search. We then use this benchmark to study and improve the generalization ability of scientific document representation models. We show how state-of-the-art models like SPECTER and SciNCL struggle to generalize across the task formats, and that simple multi-task training fails to improve them. However, a new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance. We experiment with task-format-specific control codes and adapters and find they outperform the existing single-embedding state-of-the-art by over 2 points absolute. We release the resulting family of multi-format models, called SPECTER2, for the community to use and build on.

show abstract

Specialized document embeddings for aspect-based similarity of research papers

Cited by 10 publications

References 69 publications

Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection

Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection

Application of a multimodal model optimized by multi-head-attention mechanism in the classification of fishery resource remote sensing data files

SciRepEval: A Multi-Format Benchmark for Scientific Document Representations

Contact Info

Product

Resources

About