The concept of test-retest reliability (TRR) indexes the repeatability or consistency of a measurement across time. High TRR of measures is critical for any scientific study, specifically for the study of individual differences. Evidence of poor TRR of commonly used behavioral and functional neuroimaging tasks is mounting (e.g., Hedge et al., 2018; Elliot et al., 2020). These reports have called into question the adequacy of using even the most common, well-characterized cognitive tasks with robust population-level task effects, to measure individual differences. Here, we demonstrate the limitations of the intraclass correlation coefficient (ICC), the classical metric that captures TRR as a proportional variance ratio. Specifically, the ICC metric is limited when characterizing TRR of cognitive tasks that rely on many individual trials to repeatedly evoke a psychological state or behavior. We first examine when and why conventional ICCs underestimate TRR. Further, based on recent foundational work (Rouder and Haaf, 2019; Haines et al., 2020), we lay out a hierarchical framework that takes into account the data structure down to the trial level and estimates TRR as a correlation divorced from trial-level variability. As part of this process, we examine several modeling issues associated with the conventional ICC formulation and assess how different factors (e.g., trial and subject sample sizes, relative magnitude of cross-trial variability) impact TRR. We reference the tools of TRR and 3dLMEr for the community to apply these models to behavior and neuroimaging data.