Autonomous schema markups based on intelligent computing for search engine optimization

Abbasi, Burhan Ud Din; Fatima, Iram; Mukhtar, Hamid; Khan, Sharifullah; Alhumam, Abdulaziz; Ahmad, Hafiz Farooq

doi:10.7717/peerj-cs.1163

Cited by 6 publications

(3 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…There are some approaches to extract Schema.org annotations from text (Abbasi et al , 2022) and the terminology has also been reused to some extent in other terminological efforts, as it is the case of the YAGO ontology (Tanon et al , 2020). However, we are chiefly interested here in how KGs as Wikidata may be complemented by Web sources.…”

Section: Introductionmentioning

confidence: 99%

Enhancing knowledge graphs with microdata and LLMs: the case of Schema.org and Wikidata in touristic information

Gonzalez-Garcia,

González-Carreño,

Rivas Machota

et al. 2024

View full text Add to dashboard Cite

Purpose Knowledge graphs (KGs) are structured knowledge bases that represent real-world entities and are used in a variety of applications. Many of them are created and curated from a combination of automated and manual processes. Microdata embedded in Web pages for purposes of facilitating indexing and search engine optimization are a potential source to augment KGs under some assumptions of complementarity and quality that have not been thoroughly explored to date. In that direction, this paper aims to report results on a study that evaluates the potential of using microdata extracted from the Web to augment the large, open and manually curated Wikidata KG for the domain of touristic information. As large corpora of Web text is currently being leveraged via large language models (LLMs), these are used to compare the effectiveness of the microdata enhancement method. Design/methodology/approach The Schema.org taxonomy was used as the source to determine the annotation types to be collected. Here, the authors focused on tourism-related pages as a case study, selecting the relevant Schema.org concepts as point of departure. The large CommonCrawl resource was used to select those annotations from a large recent sample of the World Wide Web. The extracted annotations were processed and matched with Wikidata to estimate the degree to which microdata produced for SEO might become a valuable resource to complement KGs or vice versa. The Web pages themselves can also serve as a context to produce additional metadata elements using them as context in pipelines of an existing LLMs. That way, both the annotations and the contents itself can be used as sources. Findings The samples extracted revealed a concentration of metadata annotations in only a few of the relevant Schema.org attributes and also revealed the possible influence of authoring tools in a significant fraction of microdata produced. The analysis of the overlapping of attributes in the sample with those of Wikidata showed the potential of the technique, limited by the disbalance of the presence of attributes. The combination of those with the use of LLMs to produce additional annotations demonstrates the feasibility of the approach in the population of existing Wikidata locations. However, in both cases, the effectiveness appears to be lower in the cases of less content in the KG, which are arguably the most relevant when considering the scenario of an automated population approach. Originality/value The research reports novel empirical findings on the way touristic annotations with a SEO orientation are being produced in the wild and provides an assessment of their potential to complement KGs, or reuse information from those graphs. It also provides insights on the potential of using LLMs for the task.

show abstract

Section: Introductionmentioning

confidence: 99%

Enhancing knowledge graphs with microdata and LLMs: the case of Schema.org and Wikidata in touristic information

Gonzalez-Garcia,

González-Carreño,

Rivas Machota

et al. 2024

View full text Add to dashboard Cite

show abstract

“…The second part uses intelligent solutions to classify the blocks into predefined types [4]. However, both of these parts currently are facing some challenges, which are related to the hierarchical nature of web page blocks [5]:…”

Section: Introductionmentioning

confidence: 99%

Web Page Content Block Identification with Extended Block Properties

Griazev

Ramanauskaitė

2023

Applied Sciences

View full text Add to dashboard Cite

Web page segmentation is one of the most influential factors for the automated integration of web page content with other systems. Existing solutions are focused on segmentation but do not provide a more detailed description of the segment including its range (minimum and maximum HTML code bounds, covering the segment content) and variants (the same segments with different content). Therefore the paper proposes a novel solution designed to find all web page content blocks and detail them for further usage. It applies text similarity and document object model (DOM) tree analysis methods to indicate the maximum and minimum ranges of each identified HTML block. In addition, it indicates its relation to other blocks, including hierarchical as well as sibling blocks. The evaluation of the method reveals its ability to identify more content blocks in comparison to human labeling (in manual labeling only 24% of blocks were labeled). By using the proposed method, manual labeling effort could be reduced by at least 70%. Better performance was observed in comparison to other analyzed web page segmentation methods, and better recall was achieved due to focus on processing every block present on a page, and providing a more detailed web page division into content block data by presenting block boundary range and block variation data.

show abstract

“…Automatizuotas duomenų išgavimas iš tinklalapių, remiantis HTML kodu, susiduria su iššūkiais, susijusiais su blokų hierarchinės prigimties atpažinimu, jų segmentavimu ir kategorizavimu (Cheng et al, 2019;Hashemi, 2020;Abbasi et al, 2022). Trūksta sprendimų, leidžiančių autonomiškai išgauti ir teisingai kategorizuoti duomenis iš nežinomų tinklalapio struktūrų, todėl siūlomas metodas, kurį taikant siekiama pagerinti tinklalapio segmentavimo sprendimus, apimant platesnį turinio blokų spektrą ir suteikiant įžvalgų apie jų vidinę struktūrą bei hierarchinius santykius.…”

Section: Tinklalapio Turinio Blokų Identifikavimas Su Išplėstinėmis B...unclassified

Internet Web page content block dataset and solutions for its data labelling simplification

Griazev

View full text Add to dashboard Cite

Vadovė prof. dr. Simona RAMANAUSKAITĖ (Vilniaus Gedimino technikos universitetas, Informatikos inžinerija -T 007). Vilniaus Gedimino technikos universiteto Informatikos inžinerijos mokslo krypties disertacijos gynimo taryba: Pirmininkas prof. dr. Dalius MAŽEIKA (Vilniaus Gedimino technikos universitetas, Informatikos inžinerija -T 007). Nariai: dr. Jolita BERNATAVIČIENĖ (Vilniaus universitetas, Informatika -N 009), dr. Robertas DAMAŠEVIČIUS (Kauno technologijos universitetas, Informatikos inžinerija -T 007), prof. dr. Arnas KAČENIAUSKAS (Vilniaus Gedimino technikos universitetas, Informatikos inžinerija -T 007), dr. Kristo KARJUST (Talino technikos universitetas, Estija, Informatikos inžinerija -T 007). Disertacija bus ginama viešame Informatikos inžinerijos mokslo krypties disertacijos gynimo tarybos posėdyje 2024 m. birželio 12 d. 14 val. Vilniaus Gedimino technikos universiteto SRA-I posėdžių salėje.

show abstract

Autonomous schema markups based on intelligent computing for search engine optimization

Cited by 6 publications

References 59 publications

Enhancing knowledge graphs with microdata and LLMs: the case of Schema.org and Wikidata in touristic information

Enhancing knowledge graphs with microdata and LLMs: the case of Schema.org and Wikidata in touristic information

Web Page Content Block Identification with Extended Block Properties

Internet Web page content block dataset and solutions for its data labelling simplification

Contact Info

Product

Resources

About