Template-generated Web pages contain most of structured data on the Web.
Clustering these pages according to their template structure is an important
problem in wrapper-based structured data extraction systems. These systems
extract structured data using wrappers that must be matched to only
particular template pages. Selecting single type of template from all crawled
Web pages is a time consuming task. Although there are methods to cluster Web
pages according to their structural similarity, however, in most cases they
are too computationally expensive to be applicable at Web-Scale. We propose a
novel highly scalable approach to structurally cluster Web pages by employing
XPath addresses of inbound inner-site links. We demonstrate the effectiveness
of our method by clustering more than one million Web pages from many real
world Websites in a few minutes and achieving >90% accuracy.
The success of a company hinges on identifying and responding to competitive pressures. The main objective of online business intelligence is to collect valuable information from many Web sources to support decision making and thus gain competitive advantage. However, the online business intelligence presents non-trivial challenges to Web data extraction systems that must deal with technologically sophisticated modern Web pages where traditional manual programming approaches often fail. In this paper, we review commercially available state-of-the-art Web data extraction systems and their technological advances in the context of online business intelligence.
We propose a novel approach for extraction of structured web data called ClustVX. It clusters visually similar web page elements by exploiting their visual formatting and structural features. Clusters are then used to derive extraction rules. The experimental evaluation results of ClustVX system on three publicly available benchmark data sets outperform state-of-the-art structured data extraction systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.