In depression research, symptoms are routinely assessed via rating scales and added to construct sum-scores. These scores are used as a proxy for depression severity in cross-sectional research, and differences in sum-scores over time are taken to reflect changes in an underlying depression construct. To allow for such interpretations, rating scales must (a) measure a single construct, and (b) measure that construct in the same way across time. These requirements are referred to as unidimensionality and measurement invariance. We investigated these 2 requirements in 2 large prospective studies (combined n = 3,509) in which overall depression levels decrease, examining 4 common depression rating scales (1 self-report, 3 clinician-report) with different time intervals between assessments (between 6 weeks and 2 years). A consistent pattern of results emerged. For all instruments, neither unidimensionality nor measurement invariance appeared remotely tenable. At least 3 factors were required to describe each scale, and the factor structure changed over time. Typically, the structure became less multifactorial as depression severity decreased (without however reaching unidimensionality). The decrease in the sum-scores was accompanied by an increase in the variances of the sum-scores, and increases in internal consistency. These findings challenge the common interpretation of sum-scores and their changes as reflecting 1 underlying construct. The violations of common measurement requirements are sufficiently severe to suggest alternative interpretations of depression sum-scores as formative instead of reflective measures. We discuss the possible causes of these violations such as response shift bias, restriction of range, and regression to the mean.