Recent efforts in semantic segmentation using deep learning framework have made notable advances. While achieving high performance, however, they often require heavy computation, making them impractical to be used in real world applications. There are two reasons that produce prohibitive computational cost: 1) heavy backbone CNN to create high resolution of contextual information and 2) complex modules to aggregate multi-level features. To address these issues, we propose the computationally efficient architecture called "Sketch-and-Fill Network (SFNet)" with a three-stage Coarse-to-Fine Aggregation (CFA) module for semantic segmentation. In the proposed network, lower-resolution contextual information is first produced so that the overall computation in the backbone CNN is largely reduced. Then, to alleviate the detail loss of the lower-resolution contextual information, the CFA module forms global structures and fills fine details in a coarse-to-fine manner. To preserve global structures, the contextual information is passed without any reduction to the CFA module. Experimental results show that the proposed SFNet achieves significantly lower computational loads while delivering comparable or improved segmentation performance with state-of-the-art methods. Qualitative results show that our method is superior to state-ofthe-art methods in capturing fine detail while keeping global structures on Cityscapes, ADE20K and RUGD benchmarks.