This study aims to address the need for design guidelines in developing a cultural-heritage-based metaverse educational system. Using the UTAUT, the TTF model, and Flow Theory, a theoretical framework is constructed. Through qualitative research based on the GT, three user perception factors—presence, interactivity, and narrativity—are introduced as external variables to explore the relationship between these factors and users’ willingness to adopt the cultural heritage metaverse system. The study examines this relationship from the dual perspectives of user perception and technology acceptance. A scale was designed to test the theoretical model empirically, and 298 valid responses were collected through a structured process involving GT coding, pre-testing, and formal surveys. The findings indicate that interactivity, narrativity, and presence significantly enhance the flow experience, while factors such as performance expectancy, effort expectancy, social influence, facilitating conditions, technology–task fit, and flow positively influence users’ intention to adopt the system. Among these, technology–task fit emerged as the most influential factor. This integrated approach reduces subjectivity and bias in criteria determination, enhancing the objectivity and precision of cultural heritage metaverse system assessments and making the system more responsive to user needs.