Real world heterogeneous scenes contain objects of a large variety of forms, surfaces, colors and textures, thus multi-modal approaches are needed to deal with their challenges. A promising method of combining various sources of information are ensemble methods which allow on the fly integration of classification modules, specific to a single sensor modality, into a classification process. These modular and extensible approaches have the advantage that they do not require that a single method copes with every eventuality, but combine existing specialized methods to overcome their weaknesses.In addition, the rapid growth of the perception field means that comparing, evaluating, sharing and combining the available approaches becomes increasingly relevant. In this article we describe a novel training strategy for ensembles of strong learners that not only outperform the best member but also the best classifier trained on the concatenation of features.The method was evaluated using a large RGBD dataset containing Kinect scans of 300 objects and special use-cases are presented that highlight how ensemble learning can be used to improve classification results.