Finding compiler bugs via live code mutation

Sun, Chengnian; Le, Vu; Su, Zhendong

doi:10.1145/2983990.2984038

Cited by 114 publications

(48 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While very different from P in general, each such program should behave functionally identically to P when executed on input x; discrepancies indicate miscompilations. Follow-on tools, Athena [Le et al 2015a] and Hermes [Sun et al 2016a], extend the EMI idea using more advanced profiling and mutation techniques; we refer to the three tools collectively as EMI. To date, the project has enabled the discovery of more than 1,600 bugs in LLVM and GCC, of which about 550 are miscompilations.…”

Section: Fuzzers and Compiler Verification Tools Studiedmentioning

confidence: 99%

“…29 Automated Compiler Testing. The idea of randomly generating or mutating programs to induce errors in production compilers and interpreters has a long history, with grammar-or mutationbased fuzzers having been designed to test implementations of languages such as COBOL [Sauder 1962], PL/I [Hanford 1970], FORTRAN [Burgess and Saidi 1996], Ada and Pascal [Wichmann 1998], and more recently C [Le et al 2014[Le et al , 2015aNagai et al 2014;Nakamura and Ishiura 2016;Sun et al 2016a;Yang et al 2011;Yarpgen 2018], JavaScript and PHP [Holler et al 2012], Java byte-code [Chen et al 2016], OpenCL [Lidbury et al 2015], GLSL [Donaldson et al 2017;Donaldson and Lascu 2016] and C++ [Sun et al 2016b] (see also two surveys on the topic [Boujarwah and Saleh 1997;Kossatchev and Posypkin 2005]). Related approaches have been used to test other programming language processors, such as static analysers , refactoring engines [Daniel et al 2007], and symbolic executors [Kapus and Cadar 2017].…”

Section: Related Workmentioning

confidence: 99%

“…Regarding the fuzzers of our study, Orange3 takes the approach of generating programs with known results [Nagai et al 2014]; Csmith [Yang et al 2011] and Yarpgen [Yarpgen 2018] are intended to be applied for differential testing; while the equivalence modulo inputs family of tools [Le et al 2014[Le et al , 2015aSun et al 2016a] as well as Orange4 [Nakamura and Ishiura 2016] represent a successful application of metamorphic testing (earlier explored with only limited success [Tao et al 2010]).…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Compiler fuzzing: how much does it matter?

Marcozzi

Tang

Donaldson

et al. 2019

Proc. ACM Program. Lang.

View full text Add to dashboard Cite

Despite much recent interest in randomised testing (fuzzing) of compilers, the practical impact of fuzzer-found compiler bugs on real-world applications has barely been assessed. We present the first quantitative and qualitative study of the tangible impact of miscompilation bugs in a mature compiler. We follow a rigorous methodology where the bug impact over the compiled application is evaluated based on (1) whether the bug appears to trigger during compilation; (2) the extent to which generated assembly code changes syntactically due to triggering of the bug; and (3) whether such changes cause regression test suite failures, or whether we can manually find application inputs that trigger execution divergence due to such changes. The study is conducted with respect to the compilation of more than 10 million lines of C/C++ code from 309 Debian packages, using 12% of the historical and now fixed miscompilation bugs found by four state-of-the-art fuzzers in the Clang/LLVM compiler, as well as 18 bugs found by human users compiling real code or as a by-product of formal verification efforts. The results show that almost half of the fuzzer-found bugs propagate to the generated binaries for at least one package, in which case only a very small part of the binary is typically affected, yet causing two failures when running the test suites of all the impacted packages. User-reported and formal verification bugs do not exhibit a higher impact, with a lower rate of triggered bugs and one test failure. The manual analysis of a selection of the syntactic changes caused by some of our bugs (fuzzer-found and non fuzzer-found) in package assembly code, shows that either these changes have no semantic impact or that they would require very specific runtime circumstances to trigger execution divergence. CCS Concepts: • Software and its engineering → Compilers; Software verification and validation.

show abstract

Section: Fuzzers and Compiler Verification Tools Studiedmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Compiler fuzzing: how much does it matter?

Marcozzi

Tang

Donaldson

et al. 2019

Proc. ACM Program. Lang.

View full text Add to dashboard Cite

show abstract

“…As common compilers require a program or a function as their input, we also categorize compiler testing [Pałka et al 2011;Sun et al 2016;Yang et al 2011] as a part of higher-order program testing. While existing works on compiler testing targeted to generate relatively sizable yet expressive programs to make divergent behaviors of several compilers, our goal is to detect a concise counterexample which causes a behavioral difference between two programs.…”

Section: Related Workmentioning

confidence: 99%

Automatic and scalable detection of logical errors in functional programming assignments

Song

Lee

2019

Proc. ACM Program. Lang.

View full text Add to dashboard Cite

This is remarkable because those test cases have been carefully designed, refined, and used in the course over the last three years. Moreover, we found that the test cases generated by our technique are more concise and easier to understand the cause of logical errors than the manual test cases. Our experiments demonstrate that our approach is more effective and efficient than an existing property-based test case generator QCheck, an OCaml version of QuickCheck [Claessen and Hughes 2000], without any human effort. Furthermore, we show that our approach is useful in the context of automated program repair. When we used our counterexample generation algorithm in combination with an existing repair system for functional programs [Lee et al. 2018b], the number of test-suite-overfitted patches reduced significantly. Contributions. In this paper, we make the following contributions: • We propose a technique for detecting logical errors in functional programming assignments, which combines enumerative search and symbolic execution in a novel way. Our approach is fully automatic and is able to handle functional features such as higher-order functions effectively. • We conduct extensive evaluations with real students' submissions. The evaluation results demonstrate that our approach is effective both in error detection of real-world submissions and alleviating test-suite-overfitted patches in automated program repair. • We provide our counterexample algorithm as a tool, called TestML. Our tool and benchmarks used in the experiments are publicly available. 1 2 MOTIVATING EXAMPLES In this section, we motivate our technique with examples. We consider three programming exercises used in our undergraduate course on functional programming. Example 1. Let us consider a programming exercise, where students are asked to write a function, called diff, which symbolically differentiates arithmetic expressions. The arithmetic expressions are defined as an OCaml datatype as follows: type aexp = Const of int | Var of string | Power of (string * int) | Sum of aexp list | Times of aexp list ∆ ′ = unify(ϒ(l), int, ∆) new α int ⟨ l , ϒ, Γ, ∆⟩ ⟨α int , ∆ ′ (ϒ), ∆ ′ (Γ), ∆ ′ ⟩ E-Num ∆ ′ = unify(ϒ(l), string, ∆) new α string ⟨ l , ϒ, Γ, ∆⟩ ⟨α string , ∆ ′ (ϒ), ∆ ′ (Γ), ∆ ′ ⟩ E-Str c ∈ C Λ(c) = (τ 1 * τ 2) → T ∆ ′ = unify(ϒ(l),T , ∆) new l 1 , l 2 ⟨ l , ϒ, Γ, ∆⟩ ⟨(c(l 1 , l 2)), ∆ ′ (ϒ[l i → τ i ] 2 i=1), ∆ ′ (Γ[l i → Γ(l)] 2 i=1), ∆ ′ ⟩ E-Cnstr ∆ ′ = unify(ϒ(l), t 1 → t 2 , ∆) new t 1 , t 2 , l ′ ⟨ l , ϒ, Γ, ∆⟩ ⟨(λx. l ′), ∆ ′ (ϒ[l ′ → t 2 ]), ∆ ′ (Γ[l ′ → Γ(l)[x → t 1 ]]), ∆ ′ ⟩ E-Fun x ∈ dom(Γ(l)) ∆ ′ = unify(ϒ(l), Γ(l)(x), ∆) ⟨ l , ϒ, Γ, ∆⟩ ⟨x, ∆ ′ (ϒ), ∆ ′ (Γ), ∆ ′ ⟩ E-Var dom(Γ(l)) ∅ ∆ ′ = unify(ϒ(l), int, ∆) new l 1 , l 2 ⟨ l , ϒ, Γ, ∆⟩ ⟨(l 1 ⊕ l 2), ∆ ′ (ϒ[l 1 → int, l 2 → int]), ∆ ′ (Γ[l 1 → Γ(l), l 2 → Γ(l)]), ∆ ′ ⟩ E-Binop dom(Γ(l)) ∅ ∆ ′ = unify(ϒ(l), string, ∆) new l 1 , l 2 ⟨ l , ϒ, Γ, ∆⟩ ⟨(l 1ˆ l 2), ∆ ′ (ϒ[l 1 → string, l 2 → string]), ∆ ′ (Γ[l 1 → Γ(l), l 2 → Γ(l)]), ∆ ′ ⟩

show abstract

“…Specifically, for a compiler, a program, and an input, it generates input-equivalent variants of the program by altering unexecuted statements and checks that these variants compiled by the compiler produce the same outputs. A recent extension of this approach [34] modifies executed statements by inserting random code guarded by a condition that is evaluated into false in the context of the given test, which can be considered as an application of value-based test-equivalence relation. Our dependency-based test-equivalence relation and the proposed composition of several analyses might be used to increase the effectiveness of compiler testing by synthesizing non-trivial input-equivalent program modifications.…”

Section: Related Workmentioning

confidence: 99%

Untitled

2018

TOSEM

View full text Add to dashboard Cite

Automated program repair is a problem of finding a transformation (called a patch) of a given incorrect program that eliminates the observable failures. It has important applications such as providing debugging aids, automatically grading student assignments, and patching security vulnerabilities. A common challenge faced by existing repair techniques is scalability to large patch spaces, since there are many candidate patches that these techniques explicitly or implicitly consider. The correctness criteria for program repair is often given as a suite of tests. Current repair techniques do not scale due to the large number of test executions performed by the underlying search algorithms. In this work, we address this problem by introducing a methodology of patch generation based on a test-equivalence relation (if two programs are "test-equivalent" for a given test, they produce indistinguishable results on this test). We propose two test-equivalence relations based on runtime values and dependencies, respectively, and present an algorithm that performs on-the-fly partitioning of patches into test-equivalence classes. Our experiments on real-world programs reveal that the proposed methodology drastically reduces the number of test executions and therefore provides an order of magnitude efficiency improvement over existing repair techniques, without sacrificing patch quality. CCS Concepts: • Software and its engineering → Automatic programming; Software testing and debugging; Dynamic analysis;

show abstract

Finding compiler bugs via live code mutation

Cited by 114 publications

References 22 publications

Compiler fuzzing: how much does it matter?

Compiler fuzzing: how much does it matter?

Automatic and scalable detection of logical errors in functional programming assignments

Untitled

Contact Info

Product

Resources

About