Deep web content monitoring

Khelghati, Mohammadreza

doi:10.3990/1.9789036541237

Cited by 4 publications

(2 citation statements)

References 50 publications

(155 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another approach uses the TF-IDF as keyword or appropriate word extraction method, link classification method, and classification technique for its crawler. In some approaches, it is proposed to use the TF-IDF algorithm with various modifications based on the specific requirements of the crawler [4][5][6]. An approach discusses the use of TF-ICF (Term Frequency-Inverse Class Frequency), which calculates the popularity score for various categories instead of TF-IDF which calculates the popularity score of words.…”

Section: Literature Reviewmentioning

confidence: 99%

ELEMENT: Text Extraction for the Dark Web

Dalvi¹,

Siddavatam

Jain

et al. 2021

Advanced Computing and Intelligent Technologies

View full text Add to dashboard Cite

The increasing amount of data on the Internet has been a constant challenge when text cleaning and relevant-text extraction are of interest. One of the areas of focus on the internet is the Dark Web; the data here is much more volatile and dynamic. With more researchers looking for data and extracting information, the algorithms have always been in a state of constant improvement. The solutions currently offered, all work based on text feature extraction algorithms like TF-IDF, Bag of Words, Word2Vec. Discussion on these extraction methods is well documented but a critical evaluation among these algorithms is amiss from standard literature. This paper discusses a balanced approach for tagging extracted data; ELEMENT (Effective Lemmatization, Efficient Management of Extracting Noteworthy Tags) which is a modified form of TF-IDF. Having a balanced approach like ELEMENT will benefit from being able to perform well under any given circumstance. The paper discusses and compares the proposed approach with existing strategies of text feature extraction. This comparison spans across accuracy of feature extraction, efficiency concerning Time and Space, concluding with a simplistic view of the strengths and weaknesses of each algorithm.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

ELEMENT: Text Extraction for the Dark Web

Dalvi¹,

Siddavatam

Jain

et al. 2021

Advanced Computing and Intelligent Technologies

View full text Add to dashboard Cite

show abstract

“…Second, the different query execution and document retrieval and processing schedules that we discuss in Section 3, and that we evaluate in Section 5, can lead to fundamentally different (e.g., in terms of quality and efficiency) focused crawling executions. Importantly, our sampling strategies are crucial for other important building blocks of deep-web crawling, in general, namely, automatic filling of search forms (Kantorski et al, 2015) and content monitoring (Mohammad Khelghati, 2016), since they require high-quality and efficient document samples from the collection to select which queries to issue and to decide when to update the content summary of the collection, respectively.…”

Section: Related Workmentioning

confidence: 99%

Sampling strategies for information extraction over the deep web

Barrio

Gravano

2017

Information Processing & Management

View full text Add to dashboard Cite

Information extraction systems discover structured information in natural language text. Having information in structured form enables much richer querying and data mining than possible over the natural language text. However, information extraction is a computationally expensive task, and hence improving the efficiency of the extraction process over large text collections is of critical interest. In this paper, we focus on an especially valuable family of text collections, namely, the so-called deep-web text collections, whose contents are not crawlable and are only available via querying. Important steps for efficient information extraction over deep-web text collections (e.g., selecting the collections on which to focus the extraction effort, based on their contents; or learning which documents within these collections-and in which order-to process, based on their words and phrases) require having a representative document sample from each collection. These document samples have to be collected by querying the deepweb text collections, an expensive process that renders impractical the existing sampling approaches developed for other data scenarios. In this paper, we systematically study the space of query-based document sampling techniques for information extraction over the deep web. Specifically, we consider (i) alternative query execution schedules, which vary on how they account for the query effectiveness, and (ii) alternative document retrieval and processing schedules, which vary on how they distribute the extraction effort over documents. We report the results of the first large-scale experimental evaluation of sampling techniques for information extraction over the deep web. Our results show the merits and limitations of the alternative query execution and document retrieval and processing strategies, and provide a roadmap for addressing this critically important building block for efficient, scalable information extraction.

show abstract

A Smart User Interface for Structured Deep Web Search

Aljohani

2023

2023 Congress in Computer Science, Computer Engineering, &Amp;amp; Applied Computing (CSCE)

View full text Add to dashboard Cite

Deep web content monitoring

Abstract: The research reported in this thesis has been carried out under the auspices of SIKS,

Cited by 4 publications

References 50 publications

ELEMENT: Text Extraction for the Dark Web

ELEMENT: Text Extraction for the Dark Web

Sampling strategies for information extraction over the deep web

A Smart User Interface for Structured Deep Web Search

Contact Info

Product

Resources

About