Concurrent programs are notorious for containing errors that are difficult to reproduce and diagnose. Two common kinds of concurrency errors are data races and atomicity violations (informally, atomicity means that executing methods concurrently is equivalent to executing them serially). Several static and dynamic (run-time) analysis techniques exist to detect potential races and atomicity violations. Run-time checking may miss errors in unexecuted code and incurs significant run-time overhead. On the other hand, run-time checking generally produces fewer false alarms than static analysis; this is a significant practical advantage, since diagnosing all of the warnings from static analysis of large codebases may be prohibitively expensive.This paper explores the use of static analysis to significantly decrease the overhead of run-time checking. Our approach is based on a type system for analyzing data races and atomicity. A type discovery algorithm is used to obtain types for as much of the program as possible (complete type inference for this type system is NP-hard, and parts of the program might be untypable). Warnings from the typechecker are used to identify parts of the program from which run-time checking can safely be omitted. The approach is completely automatic, scalable to very large programs, and significantly reduces the overhead of run-time checking for data races and atomicity violations.
A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal functions of a search engine, including crawling, indexing, ranking, and presentation, are adversely impacted by the presence of duplicate URLs. Traditionally, the de-duping problem has been addressed by fetching and examining the content of the URL; our approach here is different. Given a set of URLs partitioned into equivalence classes based on the content (URLs in the same equivalence class have similar content), we address the problem of mining this set and learning URL rewrite rules that transform all URLs of an equivalence class to the same canonical form. These rewrite rules can then be applied to eliminate duplicates among URLs that are encountered for the first time during crawling, even without fetching their content.In order to express such transformation rules, we propose a simple framework that is general enough to capture the most common URL rewrite patterns occurring on the web; in particular, it encapsulates the DUST (Different URLs with similar text) framework [5]. We provide an efficient algorithm for mining and learning URL rewrite rules and show that under mild assumptions, it is complete, i.e., our algorithm learns every URL rewrite rule that is correct, for an appropriate notion of correctness. We demonstrate the expressiveness of our framework and the effectiveness of our algorithm by performing a variety of extensive large-scale experiments.
Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging to each cluster. Preserving each mined rule for de-duplication is not efficient due to the large number of such rules. We present a machine learning technique to generalize the set of rules, which reduces the resource footprint to be usable at web-scale. The rule extraction techniques are robust against web-site specific URL conventions. We compare the precision and scalability of our approach with recent efforts in using URLs for de-duplication. Experimental results demonstrate that our approach achieves 2 times more reduction in duplicates with only half the rules compared to the most recent previous approach. Scalability of the framework is demonstrated by performing a large scale evaluation on a set of 3 Billion URLs, implemented using the MapReduce framework.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.