The present work investigates the problem of multi-modal 3D human action recognition in a holistic way, following the recent and highly promising trend within the context of Deep Learning (DL), the so-called 'Federated Learning' (FL) paradigm. In particular, novel contributions of this work include: a) a methodology for enabling the incorporation of depth and 3D flow information in DL action recognition schemes, b) multiple modality fusion schemes that operate at different levels of granularity (early, slow, late), and c) federated aggregation mechanisms for adaptively guiding the action recognition learning process, by realizing cross-domain knowledge transfer in a distributed manner. A new large-scale multi-modal multi-view 3D action recognition dataset is also introduced, which involves a total of 132 human subjects. Extensive experiments provide a detailed analysis of the problem at hand and demonstrate the particular characteristics of the involved uni/multimodal representation schemes in both centralized and distributed scenarios. It is observed that the proposed FL multi-modal schemes achieve acceptable recognition performance in the proposed dataset in two challenging data distribution scenarios.