Videos have data in multiple modalities, e.g., audio, video, text (captions). Understanding and modeling the interaction between different modalities is key for video analysis tasks like categorization, object detection, activity recognition, etc. However, data modalities are not always correlated-so, learning when modalities are correlated and using that to guide the influence of one modality on the other is crucial. Another salient feature of videos is the coherence between successive frames due to continuity of video and audio, a property that we refer to as temporal coherence. We show how using non-linear guided cross-modal signals and temporal coherence can improve the performance of multi-modal machine learning (ML) models for video analysis tasks like categorization. Our experiments on the large-scale YouTube-8M dataset show how our approach significantly outperforms state-of-the-art multi-modal ML models for video categorization. The model trained on the YouTube-8M dataset also showed good performance on an internal dataset of video segments from actual Samsung TV Plus channels without retraining or fine-tuning, showing the generalization capabilities of our model.