Editorial on the Research Topic Multimodal communication and multimodal computingAfter a successful and text-centered period, AI, computational linguistics, and natural language engineering need to face the "ecological niche" (Holler and Levinson, 2019) of natural language use: face-to-face interaction. A particular challenge of human processing in face-to-face interaction is that it is fed by information from the various sense modalities: it is multimodal. When talking to each other, we constantly observe and produce information on several channels, such as speech, facial expressions, hand-and-arm gestures, and head movements. To learn drive, we first learn theories about traffic rules in driving schools. After passing the examinations, we practice on the streets, accompanied by an expert sitting aside. We ask questions and follow instant instructions from this expert. These symbolic traffic rules and instant instructions shall be quickly and precisely grounded to the perceived scenes, with which the learner shall update and predict other cars behaviors quickly, then determine her/his own driving action to avoid potential dangers. As a consequence, multimodal communication needs to be integrated (in perception) or distributed (in production). This, however, characterizes multimodal computing in general (but see also Parcalabescu et al., 2021). Hence, AI, computational linguistics and natural language engineering that address multimodal communication in face-to-face interaction have to involve multimodal computing-giving rise to the next grand research challenge of those and related fields. This challenge applies to all computational areas which look beyond sentences and texts, ranging from interacting with virtual agents to the creation and exploitation of multimodal datasets for machine learning, as exemplified by the contributions in this Research Topic.From this perspective, we face several interwoven challenges: On the one hand, AI approaches need to be informed about the principles of multimodal computing to avoid simply transferring the principles of Large Language Models to multimodal computing. On the other hand, it is important that more linguistically motivated approaches do not underestimate the computational reconstructability of multimodal representations. They might otherwise have to share experiences with parts of computational linguistics, given the success of models such as OpenAI's ChatGPT (cf. Wolfram, 2023), which confronted them with the realization that even higher-order linguistic annotations could be taken over by digital assistants and consequently render the corresponding linguistic modeling work obsolete. Again, the scientific focus on face-to-face communication seems to point to a