Two decades since calls for stream restoration projects to be scientifically assessed, most projects are still unevaluated, and conducted evaluations yield ambiguous results. Even after these decades of investigation, do we know how to define and measure success? We systematically reviewed 26 studies of stream restoration projects that used macroinvertebrate indicators to assess the success of habitat heterogeneity restoration projects. All 26 studies were previously included in two meta-analyses that sought to assess whether restoration programs were succeeding. By contrast, our review focuses on the evaluations themselves, and asks what exactly we are measuring and learning from these evaluations. All 26 studies used taxonomic diversity, richness, or abundance of invertebrates as biological measures of success, but none presented explicit arguments why those metrics were relevant measures of success for the restoration projects. Although changes in biodiversity may reflect overall ecological condition at the regional or global scale, in the context of reach-scale habitat restoration, more abundance and diversity may not necessarily be better. While all 26 studies sought to evaluate the biotic response to habitat heterogeneity enhancement projects, about half of the studies (46%) explicitly measured habitat alteration, and 31% used visual estimates of grain size or subjectively judged 'habitat quality' from protocols ill-suited for the purpose. Although the goal of all 26 projects was to increase habitat heterogeneity, 31% of the studies either sampled only riffles or did not specify the habitats sampled. One-third of the studies (35%) used reference ecosystems to define target conditions. After 20 years of stream restoration evaluation, more work remains for the restoration community to identify appropriate measures of success and to coordinate monitoring so that evaluations are at a scale capable of detecting ecosystem change.