2020
DOI: 10.1002/spe.2886
|View full text |Cite
|
Sign up to set email alerts
|

On the synthesis of metadata tags for HTML files

Abstract: Summary RDFa, JSON‐LD, Microdata, and Microformats allow to endow the data in HTML files with metadata tags that help software agents understand them. Unluckily, there are many HTML files that do not have any metadata tags, which has motivated many authors to work on proposals to synthesize them. But they have some problems: the authors either provide an overall picture of their designs without too many details on the techniques behind the scenes or focus on the techniques but do not describe the design of the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 16 publications
0
3
0
Order By: Relevance
“…48 The second category of analysis is called static analysis 49 in which the investigation is processed before execution and the analysis is based on the information available in the URL. The extracted attributes may be comprising of lexical parameters in the URL string, host details, and occasionally HTML 50 and JavaScript code content. 51 The distribution of the information existing in benign and malicious URLs are different and with these distributions of features, a prediction framework can be built.…”
Section: Proposed Methodologymentioning
confidence: 99%
“…48 The second category of analysis is called static analysis 49 in which the investigation is processed before execution and the analysis is based on the information available in the URL. The extracted attributes may be comprising of lexical parameters in the URL string, host details, and occasionally HTML 50 and JavaScript code content. 51 The distribution of the information existing in benign and malicious URLs are different and with these distributions of features, a prediction framework can be built.…”
Section: Proposed Methodologymentioning
confidence: 99%
“…It allows to check them on a collection of well-known datasets and allows to compare the effectiveness results as homogeneously as possible and to rank them as automatically as possible. However, our recent experience with devising new information extractors (Jiménez & Corchuelo, 2016a, 2016bJiménez et al, 2021Jiménez et al, , 2020Roldán et al, 2017Roldán et al, , 2020Roldán et al, , 2021 reveals that it can be further improved to take some additional issues into account, namely: (a) whether the validation datasets are completely or partially annotated; (b) whether they contain record values or not and how their structure is taken into account to compute the effectiveness measures; and (c) how the matchings amongst the annotations and the extractions are computed.…”
Section: Related Workmentioning
confidence: 99%
“…In the literature, there are many proposals to extract data from HTML documents in general, not specifically tables (Ferrara, de Meo, Fiumara, & Baumgartner, 2014;Sleiman & Corchuelo, 2013a). They rely on text alignment (Sleiman & Corchuelo, 2013b), neural networks (Sleiman & Corchuelo, 2014), learning first-order rules (Jiménez & Corchuelo, 2016a), inferring propositiorelational rules (Jiménez & Corchuelo, 2016b), learning decision trees (Uzun, Agun, & Yerlikaya, 2013), embedding graphs (Jiménez, Roldán, Gallego, & Corchuelo, 2020), or using n-grams and rendering information (Figueiredo, Assis, & Ferreira, 2017), to mention a few. Unfortunately, they do not seem to be appropriate to extract the underlying relationships between the cells in HTML tables (Cafarella et al, 2018), which motivated much work on table-understanding (Roldán et al, 2020;Zhang & Balog, 2020).…”
Section: Context and Motivationmentioning
confidence: 99%