Abstract. Percentile flows are statistics derived from the flow duration curve (FDC) that describe the flow equaled or 8 exceeded for a given percent of time. These statistics provide important information for managing rivers, but are often 9 unavailable since most basins are ungauged. A common approach for predicting percentile flows is to deploy regional 10 regression models based on gauged percentile flows and related independent variables derived from physical and climatic 11 data. The first step of this process identifies groups of basins through a cluster analysis of the independent variables, 12 followed by the development of a regression model for each group. This entire process hinges on the independent variables 13 selected to summarize the physical and climatic state of basins. Distributed physical and climatic datasets now exist for the 14 contiguous United States (US). However, it remains unclear how to best represent these data for the development of regional 15 regression models. The study presented here developed regional regression models for the contiguous US, and evaluated the 16 effect of different approaches for selecting the initial set of independent variables on the predictive performance of the 17 regional regression models. An expert assessment of the dominant controls on the FDC was used to identify a small set of 18 independent variables likely related to percentile flows. A data-driven approach was also applied to evaluate two larger sets 19 of variables that consist of either (1) the averages of data for each basin or (2) both the averages and statistical distribution of 20 basin data distributed in space and time. The small set of variables from the expert assessment of the FDC and two larger 21 sets of variables for the data-driven approach were each applied for a regional regression procedure. Differences in 22 predictive performance were evaluated using 184 validation basins withheld from regression model development. The small 23 set of independent variables selected through expert assessment produced similar, if not better, performance than the two 24 larger sets of variables. A parsimonious set of variables only consisted of mean annual precipitation, potential 25 evapotranspiration, and baseflow index. Additional variables in the two larger sets of variables added little to no predictive 26 information. Regional regression models based on the parsimonious set of variables were developed using 734 calibration 27 basins, and were converted into a tool for predicting 13 percentile flows in the contiguous US. Supplementary Material for 28 this paper includes an R graphical user interface for predicting the percentile flows of basins within the range of conditions 29 used to calibrate the regression models. The equations and performance statistics of the models are also supplied in tabular 30 form. 31Hydrol. Earth Syst. Sci. Discuss.,