There is considerable heterogeneity in the impact of choice architecture interventions (‘nudges’), while their average effectiveness to change behavior is modest. Although they can be highly effective in some conditions, they may be ineffective in others and counter-productive in yet others. One cannot reliably predict which of these outcomes will happen due to a lack of knowledge about the generalizability of prior results. The limitations of generalizability stem from several sources. Multiple moderators operating through different mechanisms and interacting in complex ways can impact intervention effectiveness, while hidden moderators can emerge unexpectedly. In this review, we discuss the effectiveness of choice architecture interventions and highlight the obstacles to generalizability. We further review the applicable strategies (exploring and measuring potential moderators, designing for generalizability, optimizing sampling for generalizability, and enhancing reporting techniques) that could help the field of applied behavioral science more efficiently accumulate evidence about the generalizability of nudges. We conclude that adopting these practices, along with leveraging large-scale collaborations and artificial intelligence, is essential for accurately predicting the effectiveness of choice architecture interventions across diverse contexts.