Visual perception, speech perception and the understanding of perceived information are linked through complex mental processes. Gestures, as part of visual perception and synchronized with verbal information, are a key concept of human social interaction. Even when there is no physical contact (e.g., a phone conversation), humans still tend to express meaning through movement. Embodied conversational agents (ECAs), as well as humanoid robots, are visual recreations of humans and are thus expected to be able to perform similar behaviour in communication. The behaviour generation system proposed in this paper is able to specify expressive behaviour strongly resembling natural movement performed within social interaction. The system is TTS-driven and fused with the time-and-space efficient TTS-engine, called ‘PLATTOS’. Visual content and content presentation is formulated based on several linguistic features that are extrapolated from arbitrary input text sequences and prosodic features (e.g., pitch, intonation, stress, emphasis, etc.), as predicted by several verbal modules in the system. According to the evaluation results, when using the proposed system the synchronized co-verbal behaviour can be recreated with a very high-degree of naturalness, either by ECAs or humanoid robots alike.