Zihao Fu scite author profile

A Theoretical Analysis of the Repetition Problem in Text Generation

Fu¹,

Lam²,

So³

et al. 2020

Preprint

View full text Add to dashboard Cite

Text generation tasks, including translation, summarization, language models, and etc. see rapid growth during recent years. Despite the remarkable achievements, the repetition problem has been observed in nearly all text generation models undermining the generation performance extensively. To solve the repetition problem, many methods have been proposed, but there is no existing theoretical analysis to show why this problem happens and how it is resolved. In this paper, we propose a new framework for theoretical analysis for the repetition problem. We first define the Average Repetition Probability (ARP) to characterize the repetition problem quantitatively. Then, we conduct an extensive analysis of the Markov generation model and derive several upper bounds of the average repetition probability with intuitive understanding. We show that most of the existing methods are essentially minimizing the upper bounds explicitly or implicitly. Grounded on our theory, we show that the repetition problem is, unfortunately, caused by the traits of our language itself. One major reason is attributed to the fact that there exist too many words predicting the same word as the subsequent word with high probability. Consequently, it is easy to go back to that word and form repetitions and we dub it as the high inflow problem. Furthermore, we extend our analysis to broader generation models by deriving a concentration bound of the average repetition probability for a general generation model. Finally, based on the theoretical upper bounds, we propose a novel rebalanced encoding approach to alleviate the high inflow problem and thus reducing the upper bound. The experimental results show that our theoretical framework is applicable in general generation models and our proposed rebalanced encoding approach alleviates the repetition problem significantly in both the translation task and the language modeling task. The source code of this paper can be obtained from https://github.com/fuzihaofzh/repetition-problem-nlg.

show abstract

Open Domain Event Text Generation

Fu

¹

,

Lam

²

2020

AAAI

View full text Add to dashboard Cite

Text generation tasks aim at generating human-readable text from different kinds of data. Normally, the generated text only contains the information included in the data and its application is thus restricted to some limited scenarios. In this paper, we extend the task to an open domain event text generation scenario with an entity chain as its skeleton. Specifically, given an entity chain containing several related event entities, the model should retrieve from a trustworthy repository (e.g. Wikipedia) the detailed information of these entities and generate a description text based on the retrieved sentences. We build a new dataset called WikiEvent1 that provides 34K pairs of entity chain and its corresponding description sentences. To solve the problem, we propose a wiki augmented generator framework that contains an encoder, a retriever, and a decoder. The encoder encodes the entity chain into a hidden space while the decoder decodes from the hidden space and generates description text. The retriever retrieves relevant text from a trustworthy repository which provides more information for generation. To alleviate the overfitting problem, we propose a novel random drop component that randomly deletes words from the retrieved sentences making our model more robust for handling long input sentences. We apply the proposed model on the WikiEvent dataset and compare it with a few baselines. The experimental results show that our carefully-designed architecture does help generate better event text, and extensive analysis further uncovers the characteristics of the proposed task.

show abstract

A Theoretical Analysis of the Repetition Problem in Text Generation

Fu

¹

,

Lam

²

,

So

³

et al. 2021

AAAI

View full text Add to dashboard Cite

Text generation tasks, including translation, summarization, language models, and etc. see rapid growth during recent years. Despite the remarkable achievements, the repetition problem has been observed in nearly all text generation models undermining the generation performance extensively. To solve the repetition problem, many methods have been proposed, but there is no existing theoretical analysis to show why this problem happens and how it is resolved. In this paper, we propose a new framework for theoretical analysis for the repetition problem. We first define the Average Repetition Probability (ARP) to characterize the repetition problem quantitatively. Then, we conduct an extensive analysis of the Markov generation model and derive several upper bounds of the average repetition probability with intuitive understanding. We show that most of the existing methods are essentially minimizing the upper bounds explicitly or implicitly. Grounded on our theory, we show that the repetition problem is, unfortunately, caused by the traits of our language itself. One major reason is attributed to the fact that there exist too many words predicting the same word as the subsequent word with high probability. Consequently, it is easy to go back to that word and form repetitions and we dub it as the high inflow problem. Furthermore, we extend our analysis to broader generation models by deriving a concentration bound of the average repetition probability for a general generation model. Finally, based on the theoretical upper bounds, we propose a novel rebalanced encoding approach to alleviate the high inflow problem and thus reducing the upper bound. The experimental results show that our theoretical framework is applicable in general generation models and our proposed rebalanced encoding approach alleviates the repetition problem significantly in both the translation task and the language modeling task. The source code of this paper can be obtained from https://github.com/fuzihaofzh/repetition-problem-nlg.

show abstract

Word Embedding as Maximum A Posteriori Estimation

Jameel

¹

,

Fu

²

,

Shi

³

et al. 2019

AAAI

View full text Add to dashboard Cite

The GloVe word embedding model relies on solving a global optimization problem, which can be reformulated as a maximum likelihood estimation problem. In this paper, we propose to generalize this approach to word embedding by considering parametrized variants of the GloVe model and incorporating priors on these parameters. To demonstrate the usefulness of this approach, we consider a word embedding model in which each context word is associated with a corresponding variance, intuitively encoding how informative it is. Using our framework, we can then learn these variances together with the resulting word vectors in a unified way. We experimentally show that the resulting word embedding models outperform GloVe, as well as many popular alternatives.

show abstract

Learning Domain-Sensitive and Sentiment-Aware Word Embeddings

Shi¹,

Fu²,

Lam³

2018

View full text Add to dashboard Cite

Word embeddings have been widely used in sentiment classification because of their efficacy for semantic representations of words. Given reviews from different domains, some existing methods for word embeddings exploit sentiment information, but they cannot produce domainsensitive embeddings. On the other hand, some other existing methods can generate domain-sensitive word embeddings, but they cannot distinguish words with similar contexts but opposite sentiment polarity. We propose a new method for learning domain-sensitive and sentimentaware embeddings that simultaneously capture the information of sentiment semantics and domain sensitivity of individual words. Our method can automatically determine and produce domain-common embeddings and domain-specific embeddings. The differentiation of domaincommon and domain-specific words enables the advantage of data augmentation of common semantics from multiple domains and capture the varied semantics of specific words from different domains at the same time. Experimental results show that our model provides an effective way to learn domain-sensitive and sentimentaware word embeddings which benefit sentiment classification at both sentence level and lexicon term level.

show abstract

Zihao Fu

A Theoretical Analysis of the Repetition Problem in Text Generation

Open Domain Event Text Generation

A Theoretical Analysis of the Repetition Problem in Text Generation

Word Embedding as Maximum A Posteriori Estimation

Learning Domain-Sensitive and Sentiment-Aware Word Embeddings

Contact Info

Product

Resources

About