2019
DOI: 10.3390/app9235102
|View full text |Cite
|
Sign up to set email alerts
|

A Novel Approach to Data Extraction on Hyperlinked Webpages

Abstract: The World Wide Web has an enormous amount of useful data presented as HTML tables. These tables are often linked to other web pages, providing further detailed information to certain attribute values. Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms. We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites. Tables from the HTML code were extracted and table rows were labeled with ap… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 11 publications
(4 citation statements)
references
References 26 publications
0
4
0
Order By: Relevance
“…Table 5 shows the result of the random undersampling technique with multiple feature selection techniques. The GB model performed poorly, with an accuracy of 68.1% in the case of ReliefF and 68.6% with the OneR feature selections technique [ 63 , 64 , 65 ]. The XGBOOST model performed very well for correlation and information gain feature selection techniques with an accuracy of 82.8%.…”
Section: Resultsmentioning
confidence: 99%
“…Table 5 shows the result of the random undersampling technique with multiple feature selection techniques. The GB model performed poorly, with an accuracy of 68.1% in the case of ReliefF and 68.6% with the OneR feature selections technique [ 63 , 64 , 65 ]. The XGBOOST model performed very well for correlation and information gain feature selection techniques with an accuracy of 82.8%.…”
Section: Resultsmentioning
confidence: 99%
“…The information may be used to build models for predicting academic success, locating at-risk students, and spotting problematic behavior. It is designed for use in research on student behavior and performance [38][39][40][41][42][43][44][45][46][47][48][49].…”
Section: Discussionmentioning
confidence: 99%
“…The proposed algorithm [26] is to reduce the stop words that have no semantic implication. This article [27] presents a corpus of 15000 web pages and applies a novel algorithm. The corpus is made by using a web crawler.…”
Section: Literature Reviewmentioning
confidence: 99%