Syntactic SMT Using a Discriminative Text Generation Model

Zhang, Yue; Song, Kai; Song, Linfeng; Zhu, Jun; Li, Qun

doi:10.3115/v1/d14-1021

Cited by 10 publications

(9 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Input Figure 1: An illustration of our proposed benchmark, which includes diverse CTG instructions, can be used to evaluate whether large language models can properly respond to the control constraints specified in the instructions. eration (CTG) (Zhang et al 2022). While traditional CTG has been extensively studied (Dathathri et al 2019;Zhang and Song 2022), the formulation of control conditions is discrete variables, thus not directly applicable under the new instruction-following paradigm, as the latter entails natural language instructions instead.…”

Section: Llm2 Diversify Instructionsmentioning

confidence: 99%

“…eration (CTG) (Zhang et al 2022). While traditional CTG has been extensively studied (Dathathri et al 2019;Zhang and Song 2022), the formulation of control conditions is discrete variables, thus not directly applicable under the new instruction-following paradigm, as the latter entails natural language instructions instead. Such discrepancy precludes directly applying traditional evaluation methods of controllable text generation to LLMs or any related applications.…”

Section: Llm2 Diversify Instructionsmentioning

confidence: 99%

See 1 more Smart Citation

Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

Chen,

Xu,

Wang

et al. 2024

AAAI

View full text Add to dashboard Cite

While large language models (LLMs) have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. As a significant aspect of LLM alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of LLMs. To address this vacancy, we propose a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs' responses to instructions with various constraints. We construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. Specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained sub-categories. Finally, we automate the entire evaluation process to facilitate further developments. Different from existing studies on controllable text generation, CoDI-Eval extends the scope to the prevalent instruction-following paradigm for the first time. We provide extensive evaluations of representative LLMs (e.g., ChatGPT, Vicuna) on CoDI-Eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between open-source and commercial closed-source LLMs. We believe this benchmark will facilitate research into improving the controllability of LLMs' responses to instructions. Our data and code are available at https://github.com/Xt-cyh/CoDI-Eval.

show abstract

Section: Llm2 Diversify Instructionsmentioning

confidence: 99%

Section: Llm2 Diversify Instructionsmentioning

confidence: 99%

Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

Chen,

Xu,

Wang

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…The Critic engages in a debate with the Scorer and offers constructive criticism, playing the role of a Devil's Advocate. Eskenazi, 2020) is a knowledge-grounded humanto-human conversation dataset, and we refer Zhong et al (2022) to evaluate four dimensions: naturalness, coherence, engagingness, and groundedness.…”

Section: Multi-agent Scoring Frameworkmentioning

confidence: 99%

“…We extensively evaluate the performance of DEBATE with eight baselines, including a traditional evaluator, ROUGE-L (Lin, 2004); the pretrained language model-based evaluators, BERTScore , MoverScore (Zhao et al, 2019), BARTScore (Yuan et al, 2021), and UniEval (Zhong et al, 2022); the recent LLM-based evaluators, GPTScore , G-Eval , and ChatEval (Chan et al, 2023). We also include MultiAgent, a framework similar to DEBATE but with the Critic assigned a neutral debating role, denoted as Plain.…”

Section: Baselinesmentioning

confidence: 99%

See 1 more Smart Citation

Evaluation of coal-based dimethyl ether production system using life cycle assessment in South Korea

Kim

Yoon

2012

Computer Aided Chemical Engineering

View full text Add to dashboard Cite

Investigating the Role and Impact of Disfluency on Summarization

Nathan,

Kumar,

Vepa

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

View full text Add to dashboard Cite

Contact centers handle both chat and voice calls for the same domain. As part of their workflow, it is a standard practice to summarize the conversations once they conclude. A significant distinction between chat and voice communication lies in the presence of disfluencies in voice calls, such as repetitions, restarts, and replacements. These disfluencies are generally considered noise for downstream natural language understanding (NLU) tasks. While a separate summarization model for voice calls can be trained in addition to chat specific model for the same domain, it requires manual annotations for both the channels and adds complexity arising due to maintaining two models. Therefore, it's crucial to investigate if a model trained on fluent data can handle disfluent data effectively. While previous research explored impact of disfluency on question-answering and intent detection, its influence on summarization is inadequately studied. Our experiments reveal up to 6.99-point degradation in Rouge-L score, along with reduced fluency, consistency, and relevance when a fluent-trained model handles disfluent data. Replacement disfluencies have the highest negative impact. To mitigate this, we examine Fused-Fine Tuning by training the model with a combination of fluent and disfluent data, resulting in improved performance on both public and real-life datasets. Our work highlights the significance of incorporating disfluency in training summarization models and its advantages in an industrial setting.

show abstract

Syntactic SMT Using a Discriminative Text Generation Model

Cited by 10 publications

References 22 publications

Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

Evaluation of coal-based dimethyl ether production system using life cycle assessment in South Korea

Investigating the Role and Impact of Disfluency on Summarization

Contact Info

Product

Resources

About