“…For instance, in computer vision, a tremendous amount of recent work has focused on image captioning [68,30,11,16,75,45,77,31,69,4,15,10], visual question generation [36,48,47,28], visual question answering [5,19,59,54,44,73,74,76,57,58,49,50], and very recently visual dialog [13,14,27,46]. While those meticulously engineered algorithms have shown promising results in their specific domain, little is known about the end-to-end performance of an entire system.…”