“…The task of Visual question answering [7], [8], [9], [10], [11] is well studied in the vision and language community, but it has been relatively less explored for providing explanation [3] arXiv:2002.10309v1 [cs.CV] 23 Jan 2020 for answer prediction. Recently, lot of works that focus on explanation models, one of that is image captioning for basic explanation of an image [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22].…”