Building extraction and change monitoring in remote sensing (RS) imagery play pivotal roles in various applications, including urban planning, disaster management, and infrastructure monitoring. While significant progress has been made in single and bi-temporal RS images, effectively harnessing the rich temporal information of time series RS images remains a challenge. Time series RS images offer an extended temporal span for monitoring dynamic changes in building instances. However, they often exhibit noticeable appearance discrepancies and feature variations, presenting substantial obstacles to effective multitemporal information aggregation. To address these challenges, we introduce a Building Extraction and Change Monitoring Network (abbreviated as BuildMon), which jointly explores the segmentation masks, location tracking, and construction status of building instances. Our approach incorporates a spatial-temporal transformer to model relationships between images at different time spans. The windowed attention module within it can capture spatial-temporal context for a larger scope of feature aggregation. For enhancing the performance on both tasks, we adopted ground truth masks and semantic change information together as supervisory signals for BuildMon. This is complemented by the specially designed change-guided loss function, which specifically highlights regions of change and assigns targeted weights to building areas within the imagery. To validate the effectiveness of our method, we conduct comprehensive experiments on the SpaceNet 7 dataset. The results showcase the state-of-the-art performance of our approach, achieving mIoU and SCOT metrics of 67.90 and 39.73, respectively.