HMM-based speech synthesis generally suffers from typical buzziness due to over-simplified excitation modeling of voiced speech. In order to alleviate this effect, several studies have proposed various new excitation models. No consensus has however been reached on what is the perceptual importance of the accurate modeling of the periodic and aperiodic components of voiced speech, and to what extent they separately contribute in improving naturalness. This paper considers a generalized mixed excitation modeling, common to various existing approaches, in which both periodic and aperiodic components coexist. At least three main factors may alter the quality of synthesis: periodic waveform, noise spectral weighting, and noise time envelope. Based on a large subjective evaluation, the goal of this paper is threefold: i) to evaluate the relative perceptual importance of each factor, ii) to investigate what is the most appropriate method to model the periodic and aperiodic components, and iii) to provide prospective clues for future work in excitation modeling.