AMS: generating AutoML search spaces from weak specifications

Cambronero, José; Cito, Jürgen; Rinard, Martin

doi:10.1145/3368089.3409700

Cited by 12 publications

(10 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…AL [10] uses language models learned from human-written pipelines, in combination with aggressive dynamic evaluation of partial pipelines, to explore the pipeline space. AMS [7] mines constraints from corpora of human-written pipelines to help warmstart search-based AutoML like TPOT. SapientML shares AL and AMS's goal of learning from human-written pipelines.…”

Section: Related Workmentioning

confidence: 99%

“…Simply put, the pipeline is a sequence of ML operators that processes data to make it suitable for learning (feature engineering (FE)), fits a suitable ML model on it (model selection), and calculates the predictive performance of the model. One of the prominent instances of AutoML, the subject of much research recently, is creating supervised ML pipelines for tabular data [7,10,15,30,40,49,50]. This paper also focuses on this formulation of AutoML.…”

Section: Introductionmentioning

confidence: 99%

“…Indeed, program synthesis by mining or learning from existing program corpora has been successfully deployed for other endpoints of synthesis [3,25,28,29,33]. Emerging research [7,10] demonstrates the promise of this perspective for ML pipeline synthesis. Our work also follows this philosophy but offers a novel take on the core challenge of AutoML.…”

Section: Introductionmentioning

confidence: 99%

“…Some try to search a restricted search space by excluding FE from consideration [15,43,49], searching specific pipeline topologies [30], or a pre-compiled explicit corpus of synthetic pipelines [15,49,50]. Others try to prune the search space by using learned language models coupled with aggressive dynamic evaluation of partial pipelines [10] or by warm-starting search using constraints mined from human-written pipelines [7]. However, navigating the huge combinatorial search space of AutoML remains an open problem.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions

Saha,

Ura,

Mahajan

et al. 2022

Preprint

View full text Add to dashboard Cite

Automatic machine learning, or AutoML, holds the promise of truly democratizing the use of machine learning (ML), by substantially automating the work of data scientists. However, the huge combinatorial search space of candidate pipelines means that current AutoML techniques, generate sub-optimal pipelines, or none at all, especially on large, complex datasets. In this work we propose an AutoML technique SapientML, that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset. To combat the search space explosion of AutoML, SapientML employs a novel divide-and-conquer strategy realized as a three-stage program synthesis approach, that reasons on successively smaller search spaces. The first stage uses meta-learning to predict a set of plausible ML components to constitute a pipeline. In the second stage, this is then refined into a small pool of viable concrete pipelines using a pipeline dataflow model derived from the corpus. Dynamically evaluating these few pipelines, in the third stage, provides the best solution. We instantiate SapientML as part of a fully automated tool-chain that creates a cleaned, labeled learning corpus by mining Kaggle, learns from it, and uses the learned models to then synthesize pipelines for new predictive tasks. We have created a training corpus of 1,094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets, including 10 new, large, real-world datasets from Kaggle, and against 3 state-of-the-art AutoML tools and 4 baselines. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances. This difference is amplified on the 10 most challenging benchmarks, where Sapi-entML wins on 9 instances with the other tools failing to produce pipelines on 4 or more benchmarks.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions

Saha,

Ura,

Mahajan

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…A survey of techniques used in AutoML, such as hyperparameter tuning, model selection, and neural architecture search, can be found in Yao et al [100]. On the other hand, researchers and practitioners are increasingly realizing that AutoML does not solve all problems and that human factors such as design, monitoring, and configuration are still required [16,91,94,97]. In our experiments, we use an AutoML system to evaluate the performance of different feature sets without otherwise incorporating these powerful techniques into our framework.…”

Section: Testing and End-to-end MLmentioning

confidence: 99%

Enabling Collaborative Data Science Development with the Ballet Framework

Smith

Cito

et al. 2021

Proc. ACM Hum.-Comput. Interact.

Self Cite

View full text Add to dashboard Cite

While the open-source software development model has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small teams. We describe challenges to scaling data science collaborations and present a conceptual framework and ML programming model to address them. We instantiate these ideas in Ballet, the first lightweight framework for collaborative, open-source data science through a focus on feature engineering, and an accompanying cloud-based development environment. Using our framework, collaborators incrementally propose feature definitions to a repository which are each subjected to software and ML performance validation and can be automatically merged into an executable feature engineering pipeline. We leverage Ballet to conduct a case study analysis of an income prediction problem with 27 collaborators, and discuss implications for future designers of collaborative projects.CCS Concepts: • Human-centered computing → Collaborative and social computing systems and tools; Empirical studies in collaborative and social computing; • Computing methodologies → Machine learning; • Software and its engineering → Collaboration in software development.

show abstract

Automated machine learning: past, present and future

Baratchi,

Wang,

Limmer

et al. 2024

Artif Intell Rev

View full text Add to dashboard Cite

Automated machine learning (AutoML) is a young research area aiming at making high-performance machine learning techniques accessible to a broad set of users. This is achieved by identifying all design choices in creating a machine-learning model and addressing them automatically to generate performance-optimised models. In this article, we provide an extensive overview of the past and present, as well as future perspectives of AutoML. First, we introduce the concept of AutoML, formally define the problems it aims to solve and describe the three components underlying AutoML approaches: the search space, search strategy and performance evaluation. Next, we discuss hyperparameter optimisation (HPO) techniques commonly used in AutoML systems design, followed by providing an overview of the neural architecture search, a particular case of AutoML for automatically generating deep learning models. We further review and compare available AutoML systems. Finally, we provide a list of open challenges and future research directions. Overall, we offer a comprehensive overview for researchers and practitioners in the area of machine learning and provide a basis for further developments in AutoML.

show abstract

AMS: generating AutoML search spaces from weak specifications

Cited by 12 publications

References 29 publications

SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions

SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions

Enabling Collaborative Data Science Development with the Ballet Framework

Automated machine learning: past, present and future

Contact Info

Product

Resources

About