Emotion recognition from facial expressions is a highly demanding task, especially in everyday life scenarios. Different sources of artifacts have to be considered in order to successfully extract the intended emotional nuances of the face. The exact and robust detection and orientation of faces impeded by occlusions, inhomogeneous lighting and fast movements is only one difficulty. Another one is the question of selecting suitable features for the application at hand. In the literature, a vast body of different visual features grouped into dynamic, spatial and textural families, has been proposed. These features exhibit different advantages/disadvantages over each other due to their inherent structure, and thus capture complementary information, which is a promising vantage point for fusion architectures. To combine different feature sets and exploit their respective advantages, an adaptive multilevel fusion architecture is proposed. The cascaded approach integrates information on different levels and time scales using artificial neural networks for adaptive weighting of propagated intermediate results. The performance of the proposed architecture is analysed on the GEMEP-FERA corpus as well as on a novel dataset obtained from an unconstrained, spontaneuous human-computer interaction scenario. The obtained performance is superior to single channels and basic fusion techniques.