2017
DOI: 10.1016/j.ipm.2016.11.006
|View full text |Cite
|
Sign up to set email alerts
|

Sampling strategies for information extraction over the deep web

Abstract: Information extraction systems discover structured information in natural language text. Having information in structured form enables much richer querying and data mining than possible over the natural language text. However, information extraction is a computationally expensive task, and hence improving the efficiency of the extraction process over large text collections is of critical interest. In this paper, we focus on an especially valuable family of text collections, namely, the so-called deep-web text … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(3 citation statements)
references
References 41 publications
0
2
0
Order By: Relevance
“…Even though previous studies found it difficult to perform deep web collection, they require multiple keywords in order to gather associated information if available (Barrio and Gravano, 2017). Keywords should be managed according to related themes or ranges.…”
Section: Methodsmentioning
confidence: 99%
“…Even though previous studies found it difficult to perform deep web collection, they require multiple keywords in order to gather associated information if available (Barrio and Gravano, 2017). Keywords should be managed according to related themes or ranges.…”
Section: Methodsmentioning
confidence: 99%
“…M is the modeling function, gt isthegroundtruth, s isthemodifiedKendall'sTauscore,and S istheModified Kendall'sTauscoringfunction.Duringlearning,scoreswerebackpropagatedinacrawlgraphwith thehelpofreinforcementlearning.FourdatasetsofDefenseAdvancedResearchProjectsAgency (DARPA)Memexprojectwereused.Itachievedtherelevancyscoreofvalue0.698. Barrioetal.recommendedasystematicquery-basedtechniqueusedforbuildingahigh-quality document sample (Barrio & Gravano, 2017). A representative sample is always needed, which representsthedeepwebcollection.Thistechniquewasbasedonthequeryexecutionorder,revision ofqueryorder,andfiltrationofqueries.Opendirectoryprojectwasusedasadatasetthatcontains 335realwebcollections.Performancemetricssuchascoverage,samplesize,uniquetuples,and issuequerieswereusedtoevaluatethesample-basedtechnique.Improvedcoverageandsampling efficiencywerereportedratherthanexactvalues.Thisquery-basedtechniquecanbeusedinfocused crawlingforthedeepweb.…”
Section: Related Workmentioning
confidence: 99%
“…The research concentrates on a particularly respected family of text groups, individually, the so-called deep-web text groups, whose insides aren't crawl-able and are merely obtainable through enquiring. There is a very significant step for effective material extraction over deep-web text groups [22].Wang and Stewart 2015 studied geographic information science, modeling geographic dynamics found on spatiotemporal material mined from a Web, particularly unconstructed facts like online news reports. Consideration of spatiotemporal besides semantic data from a group of Web forms allows us to shape a rich exemplification of geographic details labeled in the text, taking where, when, or what proceedings have happened.…”
Section: Web Based Content Extraction and Retrieval In Web Engineeringmentioning
confidence: 99%