We address the problem of predicting the leading political ideology, i.e., left-center-right bias, for YouTube channels of news media. Previous work on the problem has focused exclusively on text and on analysis of the language used, topics discussed, sentiment, and the like. In contrast, here we study videos, which yields an interesting multimodal setup. Starting with gold annotations about the leading political ideology of major world news media from Media Bias/Fact Check, we searched on YouTube to find their corresponding channels, and we downloaded a recent sample of videos from each channel. We crawled more than 1,000 YouTube hours along with the corresponding subtitles and metadata, thus producing a new multimodal dataset. We further developed a multimodal deeplearning architecture for the task. Our analysis shows that the use of acoustic signal helped to improve bias detection by more than 6% absolute over using text and metadata only. We release the dataset to the research community, hoping to help advance the field of multi-modal political bias detection.