API design for machine learning software: experiences from the scikit-learn project

Buitinck, Lars; Louppe, Gilles; Blondel, Mathieu; Pedregosa, Fabián; Müller, Andreas; Grisel, Olivier; Niculae, Vlad; Prettenhofer, Peter; Gramfort, Alexandre; Grobler, Jaques; Layton, Robert; Vanderplas, Jake; Joly, Arnaud; Holt, Brian D.; Varoquaux, Gaël

doi:10.48550/arxiv.1309.0238

Cited by 230 publications

(234 citation statements)

References 11 publications

Supporting

Mentioning

231

Contrasting

Unclassified

Order By: Relevance

“…Parallelization for multi-core execution is also available for a set of algorithms using joblib. Inspired by scikit-learn's API design (Buitinck et al, 2013), all implemented outlier detection algorithms inherit from a base class with the same interface: (i) fit processes the train data and computes the necessary statistics; (ii) decision function generates raw outlier scores for unseen data after the model is fitted; (iii) predict returns a binary class label corresponding to each input sample instead of the raw outlier score and (iv) predict proba offers the result as a probability using either normalization or Unification (Kriegel et al, 2011). Within this framework, new models are easy to implement by taking advantage of inheritance and polymorphism.…”

Section: Library Design and Implementationmentioning

confidence: 99%

PyOD: A Python Toolbox for Scalable Outlier Detection

Zhao,

Nasrullah,

2019

Preprint

View full text Add to dashboard Cite

PyOD is an open-source Python toolbox for performing scalable outlier detection on multivariate data. Uniquely, it provides access to a wide range of outlier detection algorithms, including established outlier ensembles and more recent neural network-based approaches, under a single, well-documented API designed for use by both practitioners and researchers. With robustness and scalability in mind, best practices such as unit testing, continuous integration, code coverage, maintainability checks, interactive examples and parallelization are emphasized as core components in the toolbox's development. PyOD is compatible with both Python 2 and 3 and can be installed through Python Package Index (PyPI) or https://github.com/yzhao062/pyod.

show abstract

Section: Library Design and Implementationmentioning

confidence: 99%

PyOD: A Python Toolbox for Scalable Outlier Detection

Zhao,

Nasrullah,

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…The pipelines in the motivating examples are depicted in Figure 1, which follows the representation provided by Yang et al [71]. In this paper, we adapted the canonical definition of pipeline from Scikit-Learn pipeline specification [14,63], which is aligned with the ML models studied in the literature for fair classification tasks [3,8,10,26,27,71]. We are interested in investigating the fairness of the data preprocessing stages in the pipeline, which is depicted with grey boxes in Figure 1.…”

Section: Pipelinementioning

confidence: 99%

“…A data transformer is a well-known algorithm or method to perform a specific operation such as variable encoding, feature selection, feature extraction, dimensionality reduction, etc. on the data [14]. For example, in the second motivating example, two transformers (PCA and SelectKBest) have been used.…”

Section: Pipelinementioning

confidence: 99%

See 1 more Smart Citation

Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline

Biswas

Rajan

2021

Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Softw

View full text Add to dashboard Cite

In recent years, many incidents have been reported where machine learning models exhibited discrimination among people based on race, sex, age, etc. Research has been conducted to measure and mitigate unfairness in machine learning models. For a machine learning task, it is a common practice to build a pipeline that includes an ordered set of data preprocessing stages followed by a classifier. However, most of the research on fairness has considered a single classifier based prediction task. What are the fairness impacts of the preprocessing stages in machine learning pipeline? Furthermore, studies showed that often the root cause of unfairness is ingrained in the data itself, rather than the model. But no research has been conducted to measure the unfairness caused by a specific transformation made in the data preprocessing stage. In this paper, we introduced the causal method of fairness to reason about the fairness impact of data preprocessing stages in ML pipeline. We leveraged existing metrics to define the fairness measures of the stages. Then we conducted a detailed fairness evaluation of the preprocessing stages in 37 pipelines collected from three different sources. Our results show that certain data transformers are causing the model to exhibit unfairness. We identified a number of fairness patterns in several categories of data transformers. Finally, we showed how the local fairness of a preprocessing stage composes in the global fairness of the pipeline. We used the fairness composition to choose appropriate downstream transformer that mitigates unfairness in the machine learning pipeline. CCS CONCEPTS• Software and its engineering → Software creation and management; • Computing methodologies → Machine learning.

show abstract

“…The API closely follows that of scikit-learn [20] to make the package accessible to those with even basic knowledge of machine learning in Python [21]. The main object type in mvlearn is the estimator object, which is modeled after scikit-learn's estimator.…”

Section: Api Designmentioning

confidence: 99%

mvlearn: Multiview Machine Learning in Python

Perry¹,

Mischler²,

Guo³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

As data are generated more and more from multiple disparate sources, multiview datasets, where each sample has features in distinct views, have ballooned in recent years. However, no comprehensive package exists that enables non-specialists to use these methods easily. mvlearn is a Python library which implements the leading multiview machine learning methods. Its simple API closely follows that of scikit-learn for increased easeof-use. The package can be installed from Python Package Index (PyPI) or the conda package manager and is released under the Apache 2.0 open-source license. The documentation, detailed tutorials, and all releases are available at https://mvlearn.neurodata.io/.

show abstract

API design for machine learning software: experiences from the scikit-learn project

Cited by 230 publications

References 11 publications

PyOD: A Python Toolbox for Scalable Outlier Detection

PyOD: A Python Toolbox for Scalable Outlier Detection

Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline

mvlearn: Multiview Machine Learning in Python

Contact Info

Product

Resources

About