Purpose
Knowledge graphs (KGs) are structured knowledge bases that represent real-world entities and are used in a variety of applications. Many of them are created and curated from a combination of automated and manual processes. Microdata embedded in Web pages for purposes of facilitating indexing and search engine optimization are a potential source to augment KGs under some assumptions of complementarity and quality that have not been thoroughly explored to date. In that direction, this paper aims to report results on a study that evaluates the potential of using microdata extracted from the Web to augment the large, open and manually curated Wikidata KG for the domain of touristic information. As large corpora of Web text is currently being leveraged via large language models (LLMs), these are used to compare the effectiveness of the microdata enhancement method.
Design/methodology/approach
The Schema.org taxonomy was used as the source to determine the annotation types to be collected. Here, the authors focused on tourism-related pages as a case study, selecting the relevant Schema.org concepts as point of departure. The large CommonCrawl resource was used to select those annotations from a large recent sample of the World Wide Web. The extracted annotations were processed and matched with Wikidata to estimate the degree to which microdata produced for SEO might become a valuable resource to complement KGs or vice versa. The Web pages themselves can also serve as a context to produce additional metadata elements using them as context in pipelines of an existing LLMs. That way, both the annotations and the contents itself can be used as sources.
Findings
The samples extracted revealed a concentration of metadata annotations in only a few of the relevant Schema.org attributes and also revealed the possible influence of authoring tools in a significant fraction of microdata produced. The analysis of the overlapping of attributes in the sample with those of Wikidata showed the potential of the technique, limited by the disbalance of the presence of attributes. The combination of those with the use of LLMs to produce additional annotations demonstrates the feasibility of the approach in the population of existing Wikidata locations. However, in both cases, the effectiveness appears to be lower in the cases of less content in the KG, which are arguably the most relevant when considering the scenario of an automated population approach.
Originality/value
The research reports novel empirical findings on the way touristic annotations with a SEO orientation are being produced in the wild and provides an assessment of their potential to complement KGs, or reuse information from those graphs. It also provides insights on the potential of using LLMs for the task.