Since the proposal of the pioneering “resilience triangle” paradigm, various time-series performance-based metrics have been devised for resilience quantification. The numerous choices diversify the toolbox for measuring this compound system concept; however, this multiplicity causes intractable questions for applications, including “Do these metrics measure the same resilience?” and “Which one to pick under what circumstance?” In this study, we attempted to address these two fundamental issues using a comprehensive comparative investigation. Through a quantitative-qualitative combined approach, 12 popular performance-based resilience metrics are compared using empirical data from China’s aviation system under the disturbance of COVID-19. Quantitative results indicate that only 12 of the 66 metric pairs are strongly positively correlated and with no significant differences in quantification outcomes; qualitative results indicate that the majority of the metrics are based on different definition interpretations, basic components, and expression forms, and thus essentially measure different resilience. The advantages and disadvantages of each metric are comparatively discussed, and a “how to choose” guideline for metric users is proposed. This study is an introspective investigation of resilience quantification studies, aiming to offer a new perspective to scrutinize those benchmarking metrics.