Amit Sasturkar scite author profile

Concurrent programs are notorious for containing errors that are difficult to reproduce and diagnose. Two common kinds of concurrency errors are data races and atomicity violations (informally, atomicity means that executing methods concurrently is equivalent to executing them serially). Several static and dynamic (run-time) analysis techniques exist to detect potential races and atomicity violations. Run-time checking may miss errors in unexecuted code and incurs significant run-time overhead. On the other hand, run-time checking generally produces fewer false alarms than static analysis; this is a significant practical advantage, since diagnosing all of the warnings from static analysis of large codebases may be prohibitively expensive.This paper explores the use of static analysis to significantly decrease the overhead of run-time checking. Our approach is based on a type system for analyzing data races and atomicity. A type discovery algorithm is used to obtain types for as much of the program as possible (complete type inference for this type system is NP-hard, and parts of the program might be untypable). Warnings from the typechecker are used to identify parts of the program from which run-time checking can safely be omitted. The approach is completely automatic, scalable to very large programs, and significantly reduces the overhead of run-time checking for data races and atomicity violations.

show abstract

De-duping URLs via rewrite rules

Dasgupta

Kumar

Sasturkar

2008

View full text Add to dashboard Cite

A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal functions of a search engine, including crawling, indexing, ranking, and presentation, are adversely impacted by the presence of duplicate URLs. Traditionally, the de-duping problem has been addressed by fetching and examining the content of the URL; our approach here is different. Given a set of URLs partitioned into equivalence classes based on the content (URLs in the same equivalence class have similar content), we address the problem of mining this set and learning URL rewrite rules that transform all URLs of an equivalence class to the same canonical form. These rewrite rules can then be applied to eliminate duplicates among URLs that are encountered for the first time during crawling, even without fetching their content.In order to express such transformation rules, we propose a simple framework that is general enough to capture the most common URL rewrite patterns occurring on the web; in particular, it encapsulates the DUST (Different URLs with similar text) framework [5]. We provide an efficient algorithm for mining and learning URL rewrite rules and show that under mild assumptions, it is complete, i.e., our algorithm learns every URL rewrite rule that is correct, for an appropriate notion of correctness. We demonstrate the expressiveness of our framework and the effectiveness of our algorithm by performing a variety of extensive large-scale experiments.

show abstract

Learning URL patterns for webpage de-duplication

Koppula

Leela

Agarwal³

et al. 2010

View full text Add to dashboard Cite

Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging to each cluster. Preserving each mined rule for de-duplication is not efficient due to the large number of such rules. We present a machine learning technique to generalize the set of rules, which reduces the resource footprint to be usable at web-scale. The rule extraction techniques are robust against web-site specific URL conventions. We compare the precision and scalability of our approach with recent efforts in using URLs for de-duplication. Experimental results demonstrate that our approach achieves 2 times more reduction in duplicates with only half the rules compared to the most recent previous approach. Scalability of the framework is demonstrated by performing a large scale evaluation on a set of 3 Billion URLs, implemented using the MapReduce framework.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Amit Sasturkar

Policy Analysis for Administrative Role Based Access Control

Automated type-based analysis of data races and atomicity

Optimized run-time race detection and atomicity checking using partial discovered types

De-duping URLs via rewrite rules

Learning URL patterns for webpage de-duplication

Contact Info

Product

Resources

About