In this paper, we present an efficient framework to cognitively detect and track salient objects from videos. In general, colored visible image in red-green-blue (RGB) has better distinguishability in human visual perception, yet it suffers from the effect of illumination noise and shadows. On the contrary, the thermal image is less sensitive to these noise effects though its distinguishability varies according to environmental settings. To this end, cognitive fusion of these two modalities provides an effective solution to tackle this problem. First, a background model is extracted followed by a two-stage background subtraction for foreground detection in visible and thermal images. To deal with cases of occlusion or overlap, knowledge-based forward tracking and backward tracking are employed to identify separate objects even the foreground detection fails. To evaluate the proposed method, a publicly available color-thermal benchmark dataset Object Tracking and Classification in and Beyond the Visible Spectrum is employed here. For our foreground detection evaluation, objective and subjective analysis against several state-of-the-art methods have been done on our manually segmented ground truth. For our object tracking evaluation, comprehensive qualitative experiments have also been done on all video sequences. Promising results have shown that the proposed fusion-based approach can successfully detect and track multiple human objects in most scenes regardless of any light change or occlusion problem.