When 2D drawings are unavailable or significantly differ from the actual site, scan-to-BIM (Building Information Modeling) technology is employed to generate 3D models from point cloud data. This process is predominantly manual, but ongoing research aims to automate it. However, compared to 2D image data, 3D point clouds face a persistent shortage of data, limiting the ability of deep learning models to learn diverse data characteristics and reducing their generalization performance. To address data scarcity, this paper proposes a semi-automated framework for generating datasets for semantic segmentation using 3D point clouds and Building Information Modeling (BIM) models. The framework includes a preprocessing method to spatially segment entire building datasets and applies boundary representations of BIM objects to detect intersections with point cloud data, enabling automated labeling. Using this framework, data from five buildings were processed to create 10 areas. Additionally, six datasets were constructed by combining Stanford 3D Indoor Scene Dataset (S3DIS) data with the newly generated data, and both quantitative and qualitative evaluations were conducted on various areas. Models trained on datasets incorporating diverse domains consistently achieved the highest performance across most areas, demonstrating that diverse domain data significantly enhance model generalization. The proposed framework facilitates the generation of high-quality 3D point cloud datasets from various domains, supporting the improvement of deep learning model generalization.