“…To construct such systems, both low-level features such as object shape, region intensity, color, texture, motion descriptors, audio measurements, and high-level techniques such as human face detection, speaker identification, and character recognition have been studied for indexing and retrieving image and video information in recent years [3], [4], [10], [11], [13], [19], [21], [24], [27]- [29], [32], [36]. Among these techniques, video caption based methods have attracted particular attention due to the rich content information contained in caption text [1], [2], [6], [9], [11]- [13], [15], [16], [19], [20], [27], [33], [36]. Caption text routinely provides such valuable indexing information as scene locations, speaker names, program introductions, sports scores, special announcements, dates and time.…”