2022
DOI: 10.1145/3552490.3552496
|View full text |Cite
|
Sign up to set email alerts
|

Data Science Through the Looking Glass

Abstract: The recent success of machine learning (ML) has led to an explosive growth of systems and applications built by an ever-growing community of system builders and data science (DS) practitioners. This quickly shifting panorama, however, is challenging for system builders and practitioners alike to follow. In this paper, we set out to capture this panorama through a wide-angle lens, performing the largest analysis of DS projects to date, focusing on questions that can advance our understanding of the field and de… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2025
2025

Publication Types

Select...
3
3

Relationship

2
4

Authors

Journals

citations
Cited by 17 publications
(9 citation statements)
references
References 11 publications
0
9
0
Order By: Relevance
“…(i.e., no deep neural networks). Traditional methods are the state-of-the-art over structured data [32], and it is still the more widely-used type of ML [25], [26]. Nevertheless, we did test the performance of a shallow neural network in Section 5.8.2.…”
Section: Background: ML Workflowmentioning
confidence: 99%
See 1 more Smart Citation
“…(i.e., no deep neural networks). Traditional methods are the state-of-the-art over structured data [32], and it is still the more widely-used type of ML [25], [26]. Nevertheless, we did test the performance of a shallow neural network in Section 5.8.2.…”
Section: Background: ML Workflowmentioning
confidence: 99%
“…Manuscript received XXX (e.g., in [25] we found that pipelines can have up to hundreds of operators); 2) models are often trained once and served many times (e.g., rendering of web pages based on users' profiles, batch prediction of asset prices based on historical data), and this pattern appears quite amenable for in-DBMS execution; 3) applications where prediction serving will likely be used (e.g., websites, smart BI dashboards) are often backed by a DBMS; 4) the top used operators in practical data science over tabular data are not compute-heavy neural networks, but rather memory-intensive operations (such as one-hot encoding or tree ensemble methods [25], [26]) which should benefit from in-DBMS execution; 5) when data already resides in a database, execution of in-DBMS predictions is a natural choice, whereas a different solution will require pulling the data out of the database. This not only is a path not always practicable, for instance, if for security reasons data cannot be moved outside the database, but it also causes performance costs, while making it difficult to enforce the "Enterprise-grade" features without resorting to bespoken solutions (and likely increasing the technical debt).…”
Section: Introductionmentioning
confidence: 99%
“…Traditional ML is most widely used. According to the latest Kaggle survey [32] and an analysis of publicly available Python notebooks [69], traditional ML algorithms, such as linear/logistic regression and tree-based models (decision trees, random forests, gradient boosting) are the most popular by a large margin. 80% of the Kaggle responders use them, as opposed to 43% for neural networks.…”
Section: Motivationmentioning
confidence: 99%
“…Trained pipelines. We evaluate Raven over four popular traditional ML model types [32,69], namely, logistic regression (LR), decision tree (DT), gradient boosting (GB), and random forest (RF). Each trained pipeline includes featurizers for numerical and categorical inputs: we normalize the former using standard scaling, and encode the latter using one-hot encoding [79,80].…”
Section: Experimental Evaluationmentioning
confidence: 99%
See 1 more Smart Citation