The difference between age predicted using anatomical brain scans and chronological age, i.e., the brain-age delta, provides a proxy for atypical aging. Various data representations and machine learning (ML) algorithms have been used for brain-age estimation. However, how these choices compare on performance criteria important for real-world applications, such as; (1) within-site accuracy, (2) cross-site generalization, (3) test-retest reliability, and (4) longitudinal consistency, remains uncharacterized. We evaluated 128 workflows consisting of 16 feature representations derived from gray matter (GM) images and eight ML algorithms with diverse inductive biases. Using four large neuroimaging databases covering the adult lifespan (total N = 2953, 18-88 years), we followed a systematic model selection procedure by sequentially applying stringent criteria. The 128 workflows showed a within-site mean absolute error (MAE) between 4.73-8.38 years, from which 32 broadly sampled workflows showed a cross-site MAE between 5.23-8.98 years. The test-retest reliability and longitudinal consistency of the top 10 workflows were comparable. The choice of feature representation and the ML algorithm both affected the performance. Specifically, voxel-wise feature spaces (smoothed and resampled), with and without principal components analysis, with non-linear and kernel-based ML algorithms performed well. Strikingly, the correlation of brain-age delta with behavioral measures disagreed between within-site and cross-site predictions. Application of the best-performing workflow on the ADNI sample showed a significantly higher brain-age delta in Alzheimer's and mild cognitive impairment patients. However, in the presence of age bias, the delta estimates in the diseased population varied depending on the sample used for bias correction. Taken together, brain-age shows promise, but further evaluation and improvements are needed for its real-world application.