A standard view of human language processing is that comprehenders build richly structured mental representations of natural language utterances, word by word, using computationally costly memory operations supported by domain-general working memory resources. However, three core claims of this view have been questioned, with some prior work arguing that (1) rich word-by-word structure building is not a core function of the language comprehension system, (2) apparent working memory costs are underlyingly driven by word predictability (surprisal), and/or (3) language comprehension relies primarily on domain-general rather than domain-specific working memory resources. In this work, we simultaneously evaluate all three of these claims using naturalistic comprehension in fMRI. In each participant, we functionally localize (a) a language-selective network and (b) a 'multiple-demand' network that supports working memory across domains, and we analyze the responses in these two networks of interest during naturalistic story listening with respect to a range of theory-driven predictors of working memory demand under rigorous surprisal controls. Results show robust surprisal-independent effects of word-by-word memory demand in the language network and no effect of working memory demand in the multiple demand network. Our findings thus support the view that language comprehension (1) entails word-by-word structure building using (2) computationally intensive memory operations that are not explained by surprisal. However, these results challenge (3) the domain-generality of the resources that support these operations, instead indicating that working memory operations for language comprehension are carried out by the same neural resources that store linguistic knowledge.