Christy A. Coghlan scite author profile

Regular expressions (regexes) are a popular and powerful means of automatically manipulating text. Regexes are also an understudied denial of service vector (ReDoS). If a regex has super-linear worst-case complexity, an attacker may be able to trigger this complexity, exhausting the victim's CPU resources and causing denial of service. Existing research has shown how to detect these superlinear regexes, and practitioners have identified super-linear regex anti-pattern heuristics that may lead to such complexity.In this paper, we empirically study three major aspects of ReDoS that have hitherto been unexplored: the incidence of super-linear regexes in practice, how they can be prevented, and how they can be repaired. In the ecosystems of two of the most popular programming languages Ð JavaScript and Python ś we detected thousands of super-linear regexes affecting over 10,000 modules across diverse application domains. We also found that the conventional wisdom for super-linear regex anti-patterns has few false negatives but many false positives; these anti-patterns appear to be necessary, but not sufficient, signals of super-linear behavior. Finally, we found that when faced with a super-linear regex, developers favor revising it over truncating input or developing a custom parser, regardless of whether they had been shown examples of all three fix strategies. These findings motivate further research into ReDoS, since many modules are vulnerable to it and existing mechanisms to avoid it are insufficient. We believe that ReDoS vulnerabilities are a larger threat in practice than might have been guessed. łSome people, when confronted with a problem, think 'I know, I'll use regular expressions. ' Now they have two problems. ž śJamie Zawinski CCS CONCEPTS• Software and its engineering → Empirical software validation; Software libraries and repositories; • Security and privacy → Denial-of-service attacks;

show abstract

Why aren’t regular expressions a lingua franca? an empirical study on the re-use and portability of regular expressions

Davis

Michael

Coghlan

et al. 2019

View full text Add to dashboard Cite

This paper explores the extent to which regular expressions (regexes) are portable across programming languages. Many languages offer similar regex syntaxes, and it would be natural to assume that regexes can be ported across language boundaries. But can regexes be copy/pasted across language boundaries while retaining their semantic and performance characteristics? In our survey of 158 professional software developers, most indicated that they re-use regexes across language boundaries and about half reported that they believe regexes are a universal language. We experimentally evaluated the riskiness of this practice using a novel regex corpus Ð 537,806 regexes from 193,524 projects written in JavaScript, Java, PHP, Python, Ruby, Go, Perl, and Rust. Using our polyglot regex corpus, we explored the hitherto-unstudied regex portability problems: logic errors due to semantic differences, and security vulnerabilities due to performance differences. We report that developers' belief in a regex lingua franca is understandable but unfounded. Though most regexes compile across language boundaries, 15% exhibit semantic differences across languages and 10% exhibit performance differences across languages. We explained these differences using regex documentation, and further illuminate our findings by investigating regex engine implementations. Along the way we found bugs in the regex engines of JavaScript-V8, Python, Ruby, and Rust, and potential semantic and performance regex bugs in thousands of modules. CCS CONCEPTS • Software and its engineering → Reusability; • Social and professional topics → Software selection and adaptation.

show abstract

MCAT: Motif Combining and Association Tool

Yang

Robertson

Guo

et al. 2019

Journal of Computational Biology

View full text Add to dashboard Cite

Motivation: De novo motif discovery in biological sequences is always an important and computationally challenging problem. In the past 20 years, a myriad of algorithms have been proposed to solve this problem with varying success. Ensemble algorithms, which combine different individual algorithms, have been introduced in previous studies, and it has been proved that an ensemble strategy can improve the prediction accuracy. However, the performance of these tools has not yet met most people's expectation. One reason for the low performance is failure to adapt to complicated and large data sets. Another existing problem is that fewer motif finding tools are available, and many of them are not maintained. Results: I present a novel and fast tool MCAT (Motif Combining and Association Tool) for de novo motif discovery by combining six state-of-the-art motif discovery tools (MEME, BioProspector, DECOD, XXmotif, Weeder, and CMF). In addition, I developed an innovative motif combining algorithm, VoteRank, which is a position based algorithm that votes, ranks, and combines candidate motifs. By testing against DNA sequences from budding yeast, fission yeast, human, fruit fly, and mouse, I showed that MCAT is able to identify exact match motifs in DNA sequences efficiently and achieves at least 30% improvement in prediction accuracy. I am thankful to all of my group members and former colleagues, Jeff Robertson, Zhen Guo, Christy Coghlan, and Jake Martinez for helping with the MCAT project, Doaa Altarawy for her advice at the beginning of my research, Haitham Elmarakeby for the Beacon project, and Xiao Liang for her valuable ideas and encouragement during my research.

show abstract

Assigning Bus Delay and Predicting Travel Times using Automated Vehicle Location Data

Coghlan

Dabiri

Mayer

et al. 2019

Transportation Research Record: Journal of the Transportation R

View full text Add to dashboard Cite

The Washington Metropolitan Area Transit Authority (WMATA) operates 1,250 buses on 168 different routes between 10,600 bus stops to support around 370,000 passengers each day. Utilizing sensors on vehicles and analyzing their location and movements throughout an hour, trip, or day can provide valuable information to a transit authority as well as to the users of a transit system. This amount of information can be overwhelming, but utilizing big data techniques can empower the data and the transit agency. First, this paper develops a methodology for assessing previous delays in the system by applying big data structure and statistical analysis to the data constantly collected by WMATA buses. This method of analysis also helps quantify the impact of potential transit system improvements. Second, the paper describes a model that uses the real-time data, that represents potential delays, to provide future passengers with more accurate arrival predictions despite delays. These analyses are powerful tools for agencies and planners to assess and improve transit service performance using big data analytics and real-time predictions.

show abstract

VTLeeLab/LinguaFranca-FSE19: Artifact for the Lingua Franca paper, appearing at ESEC/FSE'19

Davis¹,

Michael²,

Coghlan³

et al. 2019

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.