We tested the processing capacity of establishing ensemble representation for multiple facial expressions using the simultaneous-sequential paradigm. Each set consisted of 16 faces conveying a variable amount of happy and angry expressions. Participants judged on a continuous scale the perceived average emotion from each face set (Experiment 1). In the simultaneous condition, the 16 faces were presented concurrently; in the sequential condition, two sets, each containing eight faces, were presented successively. Results showed that judgments varied depending on the number of happy versus angry faces contained in the sets and were sensitive at the single trial level to the perceived mean emotion intensity (based on postexperiment ratings), providing evidence of a genuine mean representation rather than the mere use of a single face or enumeration. Experiments 2 and 3 replicated Experiment 1, but implemented a different response format (binary choices) and added masks following each display, respectively. Importantly, in all three experiments, performance was consistently better in the sequential than in the simultaneous condition, revealing a limited-capacity process. A set of control analyses ruled out the use of enumeration or mere subsampling by the participants to perform the task. Collectively, these results indicate that participants could readily extract mean emotion from multiple faces shown concurrently in a set, but this process is best conceived as being capacity limited.