Although Reynolds-Averaged Navier-Stokes (RANS) equations are still the dominant tool for engineering design and analysis applications involving turbulent flows, standard RANS models are known to be unreliable in many flows of engineering relevance, including flows with separation, strong pressure gradients or mean flow curvature. With increasing amounts of 3-dimensional experimental data and high fidelity simulation data from Large Eddy Simulation (LES) and Direct Numerical Simulation (DNS), data-driven turbulence modeling has become a promising approach to increase the predictive capability of RANS simulations. However, the prediction performance of data-driven models inevitably depends on the choices of training flows. This work aims to identify a quantitative measure for a priori estimation of prediction confidence in data-driven turbulence modeling. This measure represents the distance in feature space between the training flows and the flow to be predicted. Specifically, the Mahalanobis distance and the kernel density estimation (KDE) technique are used as metrics to quantify the distance between flow data sets in feature space. To examine the relationship between these two extrapolation metrics and the machine learning model prediction performance, the flow over periodic hills at Re = 10595 is used as test set and seven flows with different configurations are individually used as training sets. The results show that the prediction error of the Reynolds stress anisotropy is positively correlated with Mahalanobis distance and KDE distance, demonstrating that both extrapolation metrics can be used to estimate the prediction confidence a priori. A quantitative comparison using correlation coefficients shows that the Mahalanobis distance is less accurate in estimating the prediction confidence than KDE distance. The extrapolation metrics introduced in this work and the corresponding analysis provide an approach to aid in the choice of data source and to assess the prediction performance for data-driven turbulence modeling.Even with the rapid growth of available computational resources, numerical models based on Reynolds-Averaged Navier-Stokes (RANS) equations are still the dominant tool for engineering design and analysis applications involving turbulent flows. However, the development of turbulence models has stagnated-the most widely used general-purpose turbulence models (e.g., k-ε models, k-ω models, and the S-A model) were all developed decades ago. These models are known to be unreliable in many flows of engineering relevance, including flows with three-dimensional structures, swirl, pressure gradients, or curvature [1]. This lack of accuracy in complex flows has diminished the utility of RANS as a predictive simulation tool for use in engineering design, analysis, optimization, and reliability assessments.Recently, data-driven turbulence modeling has emerged as a promising alternative to traditional modeling approaches. While data-driven methods come in many formulations and with different assumptions, t...