Abstract-There is an overwhelming variety of multimedia ontologies used to narrow the semantic gap, many of which are overlapping, not richly axiomatized, do not provide a proper taxonomical structure, and do not define complex correlations between concepts and roles. Moreover, not all ontologies used for image annotation are suitable for video scene representation, due to the lack of rich high-level semantics and spatiotemporal formalisms. This paper presents an approach for combining multimedia ontologies for video scene representation, while taking into account the specificity of the scenes to describe, minimizing the number of ontologies, complying with standards, minimizing reasoning complexity, and whenever possible, maintaining decidability.