Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries 2022
DOI: 10.1145/3529372.3530912
|View full text |Cite
|
Sign up to set email alerts
|

Specialized document embeddings for aspect-based similarity of research papers

Abstract: Document embeddings and similarity measures underpin contentbased recommender systems, whereby a document is commonly represented as a single generic embedding. However, similarity computed on single vector representations provides only one perspective on document similarity that ignores which aspects make two documents alike. To address this limitation, aspect-based similarity measures have been developed using document segmentation or pairwise multi-class document classification. While segmentation harms the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
10
0

Year Published

2023
2023
2025
2025

Publication Types

Select...
3
3
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 10 publications
(10 citation statements)
references
References 69 publications
0
10
0
Order By: Relevance
“…The shallow notion of what composes paraphrases used by these systems limits their understanding of the task and makes it challenging to interpret detection decisions in practice. For example, although high structural and grammatical similarities can indicate plagiarism, detection systems are often not concerned with the aspects that make two texts segments alike (Ostendorff et al, 2022;Wahle et al, 2022a).…”
Section: Introductionmentioning
confidence: 99%
“…The shallow notion of what composes paraphrases used by these systems limits their understanding of the task and makes it challenging to interpret detection decisions in practice. For example, although high structural and grammatical similarities can indicate plagiarism, detection systems are often not concerned with the aspects that make two texts segments alike (Ostendorff et al, 2022;Wahle et al, 2022a).…”
Section: Introductionmentioning
confidence: 99%
“…For example, Konaray SK et al used some experiments to analyze file extension features and used machine learning algorithms to match file suffixes and three-byte magic header information of files to achieve accurate classification of file types [1].Josh Angichiodo et al proposed the use of generative adversarial networks (SGAN) to identify file contents by means of the problem of classifying hidden files and files after their extensions or headers have been obfuscated, this adversarial network performs supervised learning based on the contents to achieve the file classification work [2]. The second one is the clustering of document content and its similarity.Malte Ostendorff,Till Blume et al have proposed Specialized Document Embeddings for Aspect-based Similarity of Research Papers [3], which represents the content of a document as a generic embedding. It treats the similarity of documents in a given aspect as a vector similarity problem in a specific embedding space.…”
Section: Introductionmentioning
confidence: 99%
“…Content-based document classification is implemented using a specialized document embedding design. In addition to these classification studies for physical documents, there are many classification methods that focus only on data content, for example, Ren F et al vectorized text content using Word2Vec word embedding method or Bert model method, and then classified text content using a bidirectional long short-term memory recurrent neural network (BiLSTM) [3]. Kang Chen and Huazheng Fu analyzed the lexical features of short paths and transformed the path word embedding features into feature images using recurrent neural networks (RNN) on the dataset [4].…”
Section: Introductionmentioning
confidence: 99%
“…Prior work addresses a similar challenge in the context of document similarity and learns multiple representations associated with different aspects of a paper (e.g. task, method, results) (Mysore et al, 2022;Ostendorff et al, 2022a). In contrast, we aim to learn effective representations for multiple downstream task formats.…”
Section: Introductionmentioning
confidence: 99%
“…We attempt to learn task-specific embeddings of documents by pre-fine-tuning on multiple objectives simultaneously. Ostendorff et al (2022a) and Mysore et al (2022) study the orthogonal task of generating multiple embeddings per paper for different "facets," while we aim to learn general embeddings for mutliple task formats.…”
Section: Introductionmentioning
confidence: 99%