Artificial intelligence (AI) researchers claim that they have made great 'achievements' in clinical realms. However, clinicians point out the so-called 'achievements' have no ability to implement into natural clinical settings. The root cause for this huge gap is that many essential features of natural clinical tasks are overlooked by AI system developers without medical background. In this paper, we propose that the clinical benchmark suite is a novel and promising direction to capture the essential features of the real-world clinical tasks, hence qualifies itself for guiding the development of AI systems, promoting the implementation of AI in real-world clinical practice.AI researchers claim that they have obtained many significant 'achievements' in various realms of clinical medicine, i.e., cancer diagnosis 1, 2 . However, in practice, most of the AI products fail to obtain approval from the Food and Drug Administration (FDA). Moreover, the approved AI products, which are quite rare, are only limited to class II or I 1 , which means that even the approved AI devices are not qualified handling high-risk tasks such as clinical diagnosis 2 . Question marks hang over the AI systems for real-world clinical tasks. Why is there such a huge gap between the AI research and AI implementation in natural clinical setting? How to promote the AI implementation into natural clinical settings to bridge the gap?The common interpretation, given by the clinicians, for this huge gap is that many technical issues in clinical settings remain unsolved, leading to the inability of the AI system in natural clinical settings, and that the AI researchers overestimate the ability of AI system validated in the artificially designed experiments 4,5 . In order to uncover the essential reasons for the gap mentioned above, our team, consisting of AI researchers and clinicians, analyzed the development process of AI systems and give the following explanation. The features of clinical tasks in natural settings are ignored in the entire lifecycle (design, implementation, and evaluation) of the AI systems, thus the generated AI system itself has no ability to be implemented in natural clinical settings.