Synthetic Target Domain Supervision for Open Retrieval QA

Reddy, Revanth Gangi; Iyer, Bhavani; Sultan, Arafat; Zhang, Rong; Sil, Avirup; Castelli, Vittorio; Florian, Radu; Roukos, Salim

doi:10.1145/3404835.3463085

Cited by 4 publications

(4 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We call this the unconditioned question generator, since the questions are not conditioned to be about any specific entities. This serves as a baseline question generation approach and is comparable with prior work [5,14,17] in synthetic data generation for IR, which do not enforce such specific conditioning into the question generation process.…”

Section: Baselinesmentioning

confidence: 82%

“…To compare our approach with a generation strategy that does not use any conditioning, we also train an unconditioned generation system, similar to Reddy et al [17], that generates question-answer pairs using just the passage as input. We call this the unconditioned question generator, since the questions are not conditioned to be about any specific entities.…”

Section: Baselinesmentioning

confidence: 99%

“…Our approach follows this line of work by leveraging the attentions of an IR model over given passages as a signal for better synthetic data augmentation. Prior work has also explored synthetic question generation for both question answering [1,18,19] and neural information retrieval [5,14,17]; different approaches to generating questions from passages include: (a) unconditioned generation [5,17], (b) generation conditioned on the candidate answer phrases within the passage [1,18], and (c) conditioned on the summary of the passage [13,24]. In contrast, our approach generates questions that are targeted towards the deficiencies of a given neural IR model, by conditioning the generation on sparsely attended entities in the passage.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval

Reddy,

Sultan,

Franz

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

We show that supervised neural information retrieval (IR) models are prone to learning sparse attention patterns over passage tokens, which can result in key phrases including named entities receiving low attention weights, eventually leading to model underperformance. Using a novel targeted synthetic data generation method that identifies poorly attended entities and conditions the generation episodes on those, we teach neural IR to attend more uniformly and robustly to all entities in a given passage. On two public IR benchmarks, we empirically show that the proposed method 1 helps improve both the model's attention patterns and retrieval performance, including in zero-shot settings. CCS CONCEPTS• Information systems → Language models; • Computing methodologies → Natural language generation.

show abstract

Section: Baselinesmentioning

confidence: 82%

Section: Baselinesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval

Reddy,

Sultan,

Franz

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In addition, to prompt-based generation of training data, there are multiple proposals for self-supervised adaptation of out-of-domain models using generative pseudo-labeling [22,38,51]. To this end, questions or queries are generated using a pretrained seq2seq model (though an LLMs can be used as well) and negative examples are mined using either BM25 or an out-of-domain retriever or ranker.…”

Section: Related Workmentioning

confidence: 99%

InPars-Light: Cost-Effective Unsupervised Training of Efficient Rankers

Boytsov¹,

Patel²,

Sourabh³

et al. 2023

Preprint

View full text Add to dashboard Cite

We carried out a reproducibility study of InPars recipe for unsupervised training of neural rankers [4]. As a by-product of this study, we developed a simple-yet-effective modification of InPars, which we called InPars-light. Unlike InPars, InPars-light uses only a freely available language model BLOOM and 7x-100x smaller ranking models. On all five English retrieval collections (used in the original InPars study) we obtained substantial (7-30%) and statistically significant improvements over BM25 in nDCG or MRR using only a 30M parameter six-layer MiniLM ranker. In contrast, in the InPars study only a 100x larger MonoT5-3B model consistently outperformed BM25, whereas their smaller MonoT5-220M model (which is still 7x larger than our MiniLM ranker), outperformed BM25 only on MS MARCO and TREC DL 2020. In a purely unsupervised setting, our 435M parameter DeBERTA v3 ranker was roughly at par with the 7x larger MonoT5-3B: In fact, on three out of five datasets, it slightly outperformed MonoT5-3B. Finally, these good results were achieved by re-ranking only 100 candidate documents compared to 1000 used in InPars. We believe that InPars-light is the first truly cost-effective prompt-based unsupervised recipe to train and deploy neural ranking models that outperform BM25. CCS CONCEPTS• Information systems → Retrieval models and ranking.

show abstract