the various speech sounds of a language are obtained by varying the shape and position of the articulators surrounding the vocal tract. Analyzing their variations is crucial for understanding speech production, diagnosing speech disorders and planning therapy. identifying key anatomical landmarks of these structures on medical images is a pre-requisite for any quantitative analysis and the rising amount of data generated in the field calls for an automatic solution. The challenge lies in the high inter-and intra-speaker variability, the mutual interaction between the articulators and the moderate quality of the images. This study addresses this issue for the first time and tackles it by means of Deep Learning. it proposes a dedicated network architecture named Flat-net and its performance are evaluated and compared with eleven state-of-the-art methods from the literature. the dataset contains midsagittal anatomical Magnetic Resonance Images for 9 speakers sustaining 62 articulations with 21 annotated anatomical landmarks per image. Results show that the Flat-net approach outperforms the former methods, leading to an overall Root Mean Square Error of 3.6 pixels/0.36 cm obtained in a leave-oneout procedure over the speakers. the implementation codes are also shared publicly on GitHub.www.nature.com/scientificreports www.nature.com/scientificreports/ animals 22 . This short review emphasizes the importance of anatomical landmark localization from images for a large variety of applications and this study lengthens this non-exhaustive list with speech production.Within the existing literature, two fields of application appear more particularly active and connected to our problem. The first is the localization of landmarks for the face, rich of an abundant research from landmark identification on two-dimensional photographs 23,24 to landmark identification on three-dimensional Ultrasounds for fetuses for instance 25,26 . A recent and comprehensive review is provided by Wu et al. 27 . It represents a challenging issue due to the high variability of the shapes, poses, occlusions, and lighting conditions. Similarly, localizing the position of the joints of the body on images to estimate the human pose is also a long-standing problem 28 . It is also a challenging issue in computer vision due to the high variability of the postures, body shapes, actions, clothes and scenes.The goal and contribution of this study is to propose a fully-automated end-to-end image analysis methods for localizing key anatomical landmarks in the vocal tract area from midsagittal MRI data. As emphasized earlier, the need for such a method in the field is required and has never been attempted so far. It is aimed to be used in the future for new speakers for which no prior data are available. Considering the existing methods of the literature and the recent rise of data science to solve such problems, Deep Learning (DL) appears as the inescapable approach 29 . Indeed, DL approaches in image processing tasks appear to outperform most of traditional t...