Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Jeong, Soyeong; Baek, Jinheon; Park, ChaeHun; Park, Jong C.

doi:10.18653/v1/2021.sdp-1.2

Cited by 4 publications

(4 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare LoGE with: The Base contains only the retrieval by BM25 scoring on documents preprocessed with the basic filters; The Pegasus4IR model, for which we have adapted Pegasus (Zhang et al , 2020), is for IR. We generate a rewritten text by applying the most widely used pretrained Pegasus model in document summarization; and UDEG (Jeong et al , 2021) is the state-of-the-art model for abstractive generation of document extensions for ad hoc search. It is the main competitive model with a similar approach to unsupervised document extension. …”

Section: Methodsmentioning

confidence: 99%

“…UDEG (Jeong et al , 2021) is the state-of-the-art model for abstractive generation of document extensions for ad hoc search. It is the main competitive model with a similar approach to unsupervised document extension.…”

Section: Methodsmentioning

confidence: 99%

“…Some works focus on resolving those problems by using deep learning for IR in the absence of query/document pairs. We can cite the UDEG model (Jeong et al , 2021). This one uses an abstract automatic summarization system to summarize documents (Lu and Conrad, 2012; Koh et al , 2022).…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

LoGE: an unsupervised local-global document extension generation in information retrieval for long documents

Ayoub,

Rodrigues,

Travers

2023

IJWIS

View full text Add to dashboard Cite

Purpose This paper aims to manage the word gap in information retrieval (IR) especially for long documents belonging to specific domains. In fact, with the continuous growth of text data that modern IR systems have to manage, existing solutions are needed to efficiently find the best set of documents for a given request. The words used to describe a query can differ from those used in related documents. Despite meaning closeness, nonoverlapping words are challenging for IR systems. This word gap becomes significant for long documents from specific domains. Design/methodology/approach To generate new words for a document, a deep learning (DL) masked language model is used to infer related words. Used DL models are pretrained on massive text data and carry common or specific domain knowledge to propose a better document representation. Findings The authors evaluate the approach of this study on specific IR domains with long documents to show the genericity of the proposed model and achieve encouraging results. Originality/value In this paper, to the best of the authors’ knowledge, an original unsupervised and modular IR system based on recent DL methods is introduced.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

LoGE: an unsupervised local-global document extension generation in information retrieval for long documents

Ayoub,

Rodrigues,

Travers

2023

IJWIS

View full text Add to dashboard Cite

show abstract

“…Besides interpolation, Wei and Zou (2019) and Ma (2019) proposed perturbation over words, and Lee et al (2021b) proposed perturbation over word embeddings. Jeong et al (2021) and Gao et al (2021) perturbed text embeddings to generate diverse sentences and to augment positive sentence pairs in unsupervised learning. In contrast, we address dense retrieval, perturbing document representations with dropout (Srivastava et al, 2014) in a supervised setting with labeled documents.…”

Section: Related Workmentioning

confidence: 99%

Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation

Jeong¹,

Baek²,

Cho³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Self Cite

View full text Add to dashboard Cite

Dense retrieval models, which aim at retrieving the most relevant document for an input query on a dense representation space, have gained considerable attention for their remarkable success. Yet, dense models require a vast amount of labeled training data for notable performance, whereas it is often challenging to acquire query-document pairs annotated by humans. To tackle this problem, we propose a simple but effective Document Augmentation for dense Retrieval (DAR) framework, which augments the representations of documents with their interpolation and perturbation. We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.

show abstract

More than Extracting "Important" Sentences: the Application of PEGASUS

Yang

Hsu

2021

2021 International Conference on Technologies and Applications of Artificial Intelligence (TAAI)

View full text Add to dashboard Cite

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Cited by 4 publications

References 29 publications

LoGE: an unsupervised local-global document extension generation in information retrieval for long documents

LoGE: an unsupervised local-global document extension generation in information retrieval for long documents

Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation

More than Extracting "Important" Sentences: the Application of PEGASUS

Contact Info

Product

Resources

About