This letter investigates the reconfigurable intelligent surface (RIS)-aided massive multiple-input multiple-output (MIMO) systems with a two-timescale design. First, the zeroforcing (ZF) detector is applied at the base station (BS) based on instantaneous aggregated channel state information (CSI), which is the superposition of the direct channel and the cascaded user-RIS-BS channel. Then, by leveraging the channel statistical property, we derive the closed-form ergodic achievable rate expression. Using a gradient ascent method, we design the RIS passive beamforming relying only on the long-term statistical CSI. We prove that the ergodic rate scales on the order of O (log 2 (M N )), where M and N denote the number of BS antennas and RIS elements, respectively. We also prove the striking superiority of the considered RIS-aided system with ZF detectors over the RIS-free systems and RIS-aided systems with maximum-ratio combining (MRC).