BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Lewis, Michael A.; Liu, Yinhan; Goyal, Naman; Ghazvininejad, Marjan; Mohamed, Abdelrahman; Levy, Omer; Stoyanov, Ves; Zettlemoyer, Luke

doi:10.48550/arxiv.1910.13461

Cited by 651 publications

(972 citation statements)

References 0 publications

Supporting

Mentioning

963

Contrasting

Unclassified

Order By: Relevance

“…Unlike BERT that is only applicable to language understanding via one encoder, MASS [37] pre-trains an encoder-decoder model for language generation via masked sequence to sequence learning proxy tasks. Mostly recently, BART [17] generalizes BERT for both language understanding and generation by combining bidirectional and auto-regressive transformers for pre-training. Taking the inspiration from MASS and BART, our work pursuits their vision-language counterpart by pre-training a universal encoder-decoder structure and fine-tuning it to both vision-language perception and generation tasks.…”

Section: Related Workmentioning

confidence: 99%

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Yehao

Fan

Pan

et al. 2022

ACM Trans. Multimedia Comput. Commun. Appl.

View full text Add to dashboard Cite

Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer-based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification, Masked Region Phrase Generation, Image-Sentence Matching, and Masked Sentence Generation. In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.

show abstract

Section: Related Workmentioning

confidence: 99%

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Yehao

Fan

Pan

et al. 2022

ACM Trans. Multimedia Comput. Commun. Appl.

View full text Add to dashboard Cite

show abstract

“…By learning on this multi-class classification problem (the auxiliary task), the model can learn general features from these images that can be used for classification (the main task) later. There are self-supervised techniques across computer vision [31], [32], natural language processing [33], and speech recognition tasks [34]. In anomaly detection, [35], [36] use self-supervised visual representation learning to learn the features of in-distribution (normal) samples.…”

Section: Related Workmentioning

confidence: 99%

Adaptive Memory Networks with Self-supervised Learning for Unsupervised Anomaly Detection

Zhang¹,

Wang²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Unsupervised anomaly detection aims to build models to effectively detect unseen anomalies by only training on the normal data. Although previous reconstruction-based methods have made fruitful progress, their generalization ability is limited due to two critical challenges. First, the training dataset only contains normal patterns, which limits the model generalization ability. Second, the feature representations learned by existing models often lack representativeness which hampers the ability to preserve the diversity of normal patterns. In this paper, we propose a novel approach called Adaptive Memory Network with Self-supervised Learning (AMSL) to address these challenges and enhance the generalization ability in unsupervised anomaly detection. Based on the convolutional autoencoder structure, AMSL incorporates a self-supervised learning module to learn general normal patterns and an adaptive memory fusion module to learn rich feature representations. Experiments on four public multivariate time series datasets demonstrate that AMSL significantly improves the performance compared to other state-of-the-art methods. Specifically, on the largest CAP sleep stage detection dataset with 900 million samples, AMSL outperforms the second-best baseline by 4%+ in both accuracy and F1 score. Apart from the enhanced generalization ability, AMSL is also more robust against input noise.

show abstract

“…Formally, given a training dataset D = {(v i , q i , a i )} s i=1 , where v i denotes the i th training image, s is the total number of training images, and q i and a i represent the question and its corresponding answer, respectively. We use a sequence-to-sequence model that is composed of an encoder and decoder, such as T5 (Raffel et al, 2020) or BART (Lewis et al, 2019). Let θ be the parameters of the model p that needs to be trained.…”

Section: Overviewmentioning

confidence: 99%

KAT: A Knowledge Augmented Transformer for Vision-and-Language

Gui¹,

Wang²,

Huang³

et al. 2021

Preprint

View full text Add to dashboard Cite

The primary focus of recent work with largescale transformers has been on optimizing the amount of information packed into the model's parameters. In this work, we ask a different question: Can multimodal transformers leverage explicit knowledge in their reasoning?Existing, primarily unimodal, methods have explored approaches under the paradigm of knowledge retrieval followed by answer prediction, but leave open questions about the quality and relevance of the retrieved knowledge used, and how the reasoning processes over implicit and explicit knowledge should be integrated. To address these challenges, we propose a novel model -Knowledge Augmented Transformer (KAT) -which achieves a strong state-of-the-art result (+6 points absolute) on the open-domain multimodal task of OK-VQA. Our approach integrates implicit and explicit knowledge in an end to end encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation. An additional benefit of explicit knowledge integration is seen in improved interpretability of model predictions in our analysis.

show abstract

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Cited by 651 publications

References 0 publications

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Adaptive Memory Networks with Self-supervised Learning for Unsupervised Anomaly Detection

KAT: A Knowledge Augmented Transformer for Vision-and-Language

Contact Info

Product

Resources

About