We present a novel method for fusing the results of multiple semantic video indexing algorithms that use different types of feature descriptors and different classification methods. This method, called Context-Dependent Fusion (CDF), is motivated by the fact that the relative performance of different semantic indexing methods can vary significantly depending on the video type, context information, and the high-level concept of the video segment to be labeled. The training part of CDF has two main components: context extraction and algorithm fusion. In context extraction, the low-level multimodal descriptors extracted by the different classification algorithms are combined and used to partition the feature space into clusters of similar video shots, or contexts. The algorithm fusion component assigns aggregation weights to the individual classifiers within each context based on their relative performance in that context. Results on the TRECVID-2002 data collections show that the proposed method can identify meaningful and coherent clusters and that the performance of the different labeling algorithms can vary significantly across different clusters. Our initial experiments have indicated that the contextdependent fusion outperforms the individual algorithms and the global fusion of those algorithms. We also show that using standard multimodal descriptors and a simple k-NN classifier, the CDF approach provides results that are comparable to the state-of-the-art methods in semantic indexing.