River metabolism modeled from diurnal dissolved oxygen (DO) has become a widely used metric of ecosystem function, yet many papers provide insufficient methodological detail for replication. Only 79% of 43 sampled papers published from 2015 to 2019 mention calibration, 44% describe sensor placement, and 34% did not describe estimation approaches such that the study could be replicated. Given spatial heterogeneity in rivers influences metabolism, and measurement sensitivities vary with sensor model, it is important to have appropriately detailed information in reported methods along with a fundamental understanding of how river heterogeneity might influence metabolism. We deployed 2-8 sensors at 92 steppe river reaches to characterize site heterogeneity, evaluating how sensor placement and type, deployment length, drift correction, data source, local vs. remotely sensed data, and calibration can affect metabolism estimates. Estimates of gross primary production (GPP) and ecosystem respiration (ER) were inconsistent and unpredictable depending on deployment location within a river reach; GPP and ER rates varied up to 131% and 69%, respectively, across a river width and up to two orders of magnitude within a reach. DO sensor brands vary in precision and accuracy; we found even when operated within stated performance range, estimates of GPP and ER could vary by 82% and 198%, respectively, if not calibrated beyond factory setting, as determined using field data from a sample site. Inaccuracies from sensor drift over weeklong deployments led to an average 48% ER overestimation, and 2% GPP overestimation comparing uncorrected with corrected field data. We suggest best practices for more comparable, precise, representative, and accurate methods.