A case study on machine learning for synthesizing benchmarks

Goens, Andrés; Brauckmann, Alexander; Ertel, Sebastian; Cummins, Chris; Leather, Hugh; Castrillón, Jerónimo

doi:10.1145/3315508.3329976

Cited by 8 publications

(5 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We analyze the dataset to explain this phenomenon and find CLgen generates a lot of comments, repeated dead statements and awkward nonhuman-like code such as multiple semi-colons. These results agree with the case study by Goens et al [14] that shows the AST depth distribution of CLgen's code is significantly narrower compared to code from GitHub or standard benchmarks.…”

Section: Analysis Of Benchpress and Clgen Language Modelssupporting

confidence: 91%

“…Its synthetic benchmarks improve the accuracy of Grewe's et al predictive model [16] by 1.27×. However, Goens et al [14] perform a case study and show evidence that CLgen's synthetic benchmarks do not improve the quality of training data and, consequently, performance of predictive models. They show that a predictive model in fact performs worse with synthetic benchmarks as opposed to human written benchmarks or code from GitHub.…”

Section: Analysis Of Benchpress and Clgen Language Modelsmentioning

confidence: 97%

“…The authors present the Grewe et al [16] heuristic model improved its performance by 1.27× when trained on their synthetic benchmarks. However, Goens et al [14] show that training with CLgen's synthetic samples lead to a slowdown compared to training on human-written benchmarks only. To explain this, they measure the AST depth of CLgen's samples and show it is 3× smaller compared to humanwritten benchmarks and code from GitHub and poor in features, therefore unrealistic.…”

Section: Related Workmentioning

confidence: 99%

“…There have been some recent generative approaches that leverage the rise of deep learning and language modeling to mitigate this shortage by automatically generating synthetic programs to enhance existing human-written benchmarks [1,5,7]. While they could provide elegant solutions to improve training data for predictive models, these synthetic benchmarks seem to be short, repetitive with little new features compared to existing benchmarks [14]. To generate programs, they either use static programming language specifications with fuzzing or sample programs from learnt distributions, e.g., machine learning algorithms.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

BenchPress

Tsimpourlas

Petoumenos

Xu³

et al. 2022

Proceedings of the International Conference on Parallel Architectures and Compilation Techniques

Self Cite

View full text Add to dashboard Cite

The exponential increase of hardware-software complexity has made it impossible for compiler engineers to find the right optimization heuristics manually. Predictive models have been shown to find near optimal heuristics with little human effort but they are limited by a severe lack of diverse benchmarks to train on. Generative AI has been used by researchers to synthesize benchmarks into existing datasets. However, the synthetic programs are short, exceedingly simple and lacking diversity in their features.We develop BenchPress, the first ML compiler benchmark generator that can be directed within source code feature representations. BenchPress synthesizes executable functions by infilling code that conditions on the program's left and right context. BenchPress uses active learning to introduce new benchmarks with unseen features into the dataset of Grewe's et al. CPU vs GPU heuristic, improving its acquired performance by 50%. BenchPress targets features that has been impossible for other synthesizers to reach. In 3 feature spaces, we outperform human-written code from GitHub, CLgen, CLSmith and the SRCIROR mutator in targeting the features of Rodinia benchmarks.BenchPress steers generation with beam search over a featureagnostic language model. We improve this with BenchDirect which utilizes a directed LM that infills programs by jointly observing source code context and the compiler features that are targeted. BenchDirect achieves up to 36% better accuracy in targeting the features of Rodinia benchmarks, it is 1.8× more likely to give an exact match and it speeds up execution time by up to 72% compared to BenchPress. Both our models produce code that is difficult to distinguish from human-written code. We conduct a Turing test which shows our models' synthetic benchmarks are labelled as 'human-written' as often as human-written code from GitHub.

show abstract

Section: Analysis Of Benchpress and Clgen Language Modelssupporting

confidence: 91%

Section: Analysis Of Benchpress and Clgen Language Modelsmentioning

confidence: 97%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

BenchPress

Tsimpourlas

Petoumenos

Xu³

et al. 2022

Proceedings of the International Conference on Parallel Architectures and Compilation Techniques

Self Cite

View full text Add to dashboard Cite

show abstract

“…The various design dimensions such as applied algorithm, architecture, or quantization span a large design space for cascaded classifiers. Adding the substantial variety of existing datasets and their complex feature space, benchmarking different solutions becomes a challenge by itself [7].…”

Section: Introductionmentioning

confidence: 99%

Cascaded Classifier for Pareto-Optimal Accuracy-Cost Trade-Off Using off-the-Shelf ANNs

Latotzke,

Loh,

Gemmeke

2021

Preprint

View full text Add to dashboard Cite

Machine-learning classifiers provide high quality of service in classification tasks. Research now targets cost reduction measured in terms of average processing time or energy per solution. Revisiting the concept of cascaded classifiers, we present a first of its kind analysis of optimal pass-on criteria between the classifier stages. Based on this analysis, we derive a methodology to maximize accuracy and efficiency of cascaded classifiers. On the one hand, our methodology allows cost reduction of 1.32× while preserving reference classifier's accuracy. On the other hand, it allows to scale cost over two orders while gracefully degrading accuracy. Thereby, the final classifier stage sets the top accuracy. Hence, the multi-stage realization can be employed to optimize any state-of-the-art classifier.

show abstract