Fully automatic methods that extract lists of objects from the Web have been studied extensively. Record extraction, the first step of this object extraction process, identifies a set of Web page segments, each of which represents an individual object (e.g., a product). State-of-the-art methods suffice for simple search, but they often fail to handle more complicated or noisy Web page structures due to a key limitationtheir greedy manner of identifying a list of records through pairwise comparison (i.e., similarity match) of consecutive segments. This paper introduces a new method for record extraction that captures a list of objects in a more robust way based on a holistic analysis of a Web page. The method focuses on how a distinct tag path appears repeatedly in the DOM tree of the Web document. Instead of comparing a pair of individual segments, it compares a pair of tag path occurrence patterns (called visual signals) to estimate how likely these two tag paths represent the same list of objects. The paper introduces a similarity measure that captures how closely the visual signals appear and interleave. Clustering of tag paths is then performed based on this similarity measure, and sets of tag paths that form the structure of data records are extracted. Experiments show that this method achieves higher accuracy than previous methods.
Detecting complex patterns in event streams, i.e., complex event processing (CEP), has become increasingly important for modern enterprises to react quickly to critical situations. In many practical cases business events are generated based on pre-defined business logics. Hence constraints, such as occurrence and order constraints, often hold among events. Reasoning using these known constraints enables us to predict the non-occurrences of certain future events, thereby helping us to identify and then terminate the long running query processes that are guaranteed to not lead to successful matches.In this work, we focus on exploiting event constraints to optimize CEP over large volumes of business transaction streams. Since the optimization opportunities arise at runtime, we develop a runtime query unsatisfiability (RunSAT) checking technique that detects optimal points for terminating query evaluation. To assure efficiency of RunSAT checking, we propose mechanisms to precompute the query failure conditions to be checked at runtime. This guarantees a constant-time RunSAT reasoning cost, making our technique highly scalable. We realize our optimal query termination strategies by augmenting the query with Event-Condition-Action rules encoding the pre-computed failure conditions. This results in an event processing solution compatible with state-of-the-art CEP architectures. Extensive experimental results demonstrate that significant performance gains are achieved, while the optimization overhead is small.
Web performance is a key differentiation among content providers. Snafus and slowdowns at major web sites demonstrate the difficulty that companies face trying to scale to a large amount of web traffic. One solution to this problem is to store web content at server-side and edge-caches for fast delivery to the end users. However, for many e-commerce sites, web pages are created dynamically based on the current state of business processes, represented in application servers and databases . Since application servers, databases, web servers, and caches are independent components, there is no efficient mechanism to make changes in the database content reflected to the cached web pages. As a result, most application servers have to mark dynamically generated web pages as non-cacheable. In this paper, we describe the architectural framework of the CachePortal system for enabling dynamic content caching for database-driven e-commerce sites. We describe techniques for intelligently invalidating dynamically generated web pages in the caches, thereby enabling caching of web pages generated based on database contents. We use some of the most popular components in the industry to illustrate the deployment and applicability of the proposed architecture.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.