Over the last decade, with the popularization of camera-equipped devices, there has been an explosive growth of video data. Despite the diverse visual contents, there are usually some thematic objects in these videos. As the key objects to be presented, thematic objects appear frequently and occupy highlighted positions in the video scenes, thus retain our impression after watching the videos, such as the bride and the groom in wedding ceremony videos, the birthday girl in birthday party videos, or product logo in commercial videos. Automatically discovering and localizing these thematic objects can benefit many real-world applications, such as video summarization, search, and labeling. However, this task is challenging as there is no prior information or initialization about the thematic objects.Moreover, there is usually background clutter, occlusions, or camera motions accompanying the targets. In this thesis, a systematic study is conducted on the automatic discovery and localization of thematic objects in videos.We have studied this problem under various settings, including automatic discovery and localization of the thematic object in single videos, automatic discovery and segmentation of the thematic object in single videos, and automatic thematic action discovery and localization in collections of videos. In the absence of category-specific supervision and manual initialization, various category-independent cues have been explored to discover and localize the thematic objects. These include the spatiotemporal saliency to highlight