“…extremely noisy environments and/or military situations). For this automatic conversion task, typically electromagnetic articulography (EMA, [3,19,20]), ultrasound tongue imaging (UTI, [4,14,18,28]), permanent magnetic articulography (PMA, [10]), surface Electromyography (sEMG, [6,16,22]), lip video [1,7] and multimodal approaches are used [5]. Current SSI systems mostly apply the "direct synthesis" principle, where speech is generated without an intermediate step, directly from the articulatory data.…”