Data integration is one of the most challenging tasks for digital collections whose data are stored across various repositories. Data integration across digital repositories has several challenges. First, data heterogeneity in terms of data schema and data values usually occurs across diverse data sources. Second, heterogeneity in data representation and semantic issues are among the problems. The same data may appear in different repositories with varied data representations, i.e., metadata schema. Recent research has focused on matching several related metadata schemas. In this paper, a metadata integration framework is proposed to support digital repositories in socio-cultural anthropology at the Princess Maha Chakri Sirindhorn Anthropology Centre (SAC), Thailand. The proposed framework is defined based on the Metadata Lifecycle Model (MLM). It utilizes non-procedural schema mappings to express data relationships in diverse schemas. A case study of metadata integration over the SAC digital repositories was conducted to validate the framework. The SAC common metadata schema was designed to support data mapping across 13 digital repositories. The SAC “One Search” system was developed to exemplify the system implementation of the framework. Evaluation results showed that the proposed metadata integration framework can support domain experts in socio-cultural anthropology in unified searching across the repositories.
Digital preservation technologies are now being increasingly adopted by cultural heritage organizations. This cultural heritage data is often disseminated in the form of digital text through a variety of channels such as Wikipedia, cultural heritage archives, etc. To acquire knowledge from digital data, the extraction technique becomes an important part. However, in the case of digital text, which has characteristics such as ambiguity, complex grammar structures such as the Thai language, and others, it makes it more challenging to extract information with a high level of accuracy. We thus propose a method for improving the performance of data extraction techniques based on word features, multiple instance learning, and unseen word mapping. Word features are used to improve the quality of word definition by concatenating parts of speech (POS) and word position is used to establish the accurate definition of a word and convert all of this into a vector. In addition, we use multiple instance learning to solve issues where words do not fully express the meaning of the triple. We also cluster the particular word to find the predicate word by removing words that are irrelevant between the subject and the object. The difficulty of having a new set of words that have never been trained before can be overcome by using unseen word mapping with sub-word and nearest neighbor word mapping. We conducted several experiments on a cultural heritage knowledge graph to show the efficacy of the proposed method. The results demonstrated that our proposed technique outperforms existing models currently utilized in relation to extraction systems. It can achieve excellent accuracy since its precision, recall, and F1 score are 0.89, 0.88, and 0.89, respectively. Furthermore, it also performed well in terms of unseen word prediction, precision, recall, and F1 score, which were 0.81, 0.87, and 0.84, respectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.