Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity

Mysore, Sheshera; Cohan, Arman; Hope, Tom

doi:10.18653/v1/2022.naacl-main.331

Cited by 9 publications

(6 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The most basic idea to calculate REDi is the fact that papers cite strongly related papers to their own research. There is a study by Mysore et al ( 2022 ) that is based on the same idea. In this study, the degree of relatedness strength is replaced by the distance between the fields.…”

Section: Methodsmentioning

confidence: 99%

Impact of supervisors' research style on young biomedical scientists' capacity development as measured by REDi, a novel index of crossdisciplinarity

Hashiguchi

Hamada²,

Takahashi³

et al. 2022

Front. Res. Metr. Anal.

View full text Add to dashboard Cite

The challenge for medical schools in Japan is to develop research activities for innovation. This study aimed at analyzing the connection between the research output of “promising researchers” (next-generation leaders in terms of research activity) and their supervisors' past research activities to identify those factors that impact researchers' performance. Activity was analyzed from the viewpoints of productivity, coauthorship networks, and research impact using a novel index called the Research Diversity Index (REDi) that quantifies crossdisciplinarity. Research funding, which has not yet been fully utilized in correlation studies of the characteristics of authors, was also considered in this study. For the promising researchers extracted using betweenness centrality scores within coauthorship networks, there were diachronic correlations between the records of the promising researchers and those of their supervisors. Supervisor leadership as measured by the number of last-authored publications and extent of networking had a positive effect on the promising researchers productivity. Supervisors' research style of integrating knowledge from multiple fields, as measured by REDi, was negatively correlated with the publication impact of promising researchers, suggesting that REDi is useful as a novel indicator of research quality not being captured by existing indices. It was also noted that establishing an academic presence through extensive collaborations could be advantageous for obtaining research funding, especially from top-down government programs. The possible implications of this study for promoting research activities are the importance of incorporating new doctorates into research groups at an early stage and that of promoting interinstitutional, crossdisciplinary collaborations.Classification codeMSC: 62P10 Applications of statistics to biology and medical sciences; meta-analysis.JEL: Z1Z10 Cultural Economics • Economic Sociology • Economic Anthropology- General.

show abstract

Section: Methodsmentioning

confidence: 99%

Impact of supervisors' research style on young biomedical scientists' capacity development as measured by REDi, a novel index of crossdisciplinarity

Hashiguchi

Hamada²,

Takahashi³

et al. 2022

Front. Res. Metr. Anal.

View full text Add to dashboard Cite

show abstract

“…In the scientific domain, contrastive learning of cross-document links (e.g. citations) has led to improved document-level representations (Cohan et al, 2020;Ostendorff et al, 2022b;Mysore et al, 2022). These representations can be indexed and consumed later by lightweight downstream models without additional fine-tuning.…”

Section: Introductionmentioning

confidence: 99%

“…Further, we use this benchmark to investigate and improve the generalization ability of document representation models. Following recent work (Cohan et al, 2020;Ostendorff et al, 2022b;Mysore et al, 2022) we pre-fine-tune a transformer model originally trained on citation triplets to produce high-quality representations for downstream tasks. We hypothesize that condensing all relevant information of a document into a single vector might not be expressive enough for generalizing across a wide range of tasks.…”

Section: Introductionmentioning

confidence: 99%

“…Prior work addresses a similar challenge in the context of document similarity and learns multiple representations associated with different aspects of a paper (e.g. task, method, results) (Mysore et al, 2022;Ostendorff et al, 2022a). In contrast, we aim to learn effective representations for multiple downstream task formats.…”

Section: Introductionmentioning

confidence: 99%

“…We attempt to learn task-specific embeddings of documents by pre-fine-tuning on multiple objectives simultaneously. Ostendorff et al (2022a) and Mysore et al (2022) study the orthogonal task of generating multiple embeddings per paper for different "facets," while we aim to learn general embeddings for mutliple task formats.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

SciRepEval: A Multi-Format Benchmark for Scientific Document Representations

Singh,

D’Arcy,

Cohan

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Learned representations of scientific documents can serve as valuable input features for downstream tasks without further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In response, we introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations. It includes 24 challenging and realistic tasks, 8 of which are new, across four formats: classification, regression, ranking and search. We then use this benchmark to study and improve the generalization ability of scientific document representation models. We show how state-of-the-art models like SPECTER and SciNCL struggle to generalize across the task formats, and that simple multi-task training fails to improve them. However, a new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance. We experiment with task-format-specific control codes and adapters and find they outperform the existing single-embedding state-of-the-art by over 2 points absolute. We release the resulting family of multi-format models, called SPECTER2, for the community to use and build on.

show abstract