Acquiring reservoir fluid samples through formation testers is critical to asset evaluation in most oil and gas drilling operations. From the time this technology was introduced to the industry, the key challenges have been in planning the job, estimating contamination during operation, and obtaining clean fluid samples in the shortest time possible. The objective of this paper is to create a new data-driven model to proactively simulate the cleaning process in order to provide a practical job-planning tool that optimizes fluid sampling. After detailed analysis of formation pump-out cleaning behavior and oil-well sampling, a parametric study with nearly a hundred thousand scenarios was designed to model fluid behavior during sampling. The simulation scenario is a multi-component model with radial geometry, capable of handling complex reservoir rock, fluid composition, probe geometry, and sampling conditions. Compositional simulation output is then used to generate the comprehensive database of the fluid sampling and cleaning processes. The study is used to determine the sensitive parameters related to sampling and contamination. Full factorial experimental design was used to build nearly one hundred thousand scenarios with more than 10 relevant parameters. Outputs were analyzed through a variety of visualization and statistical techniques to understand cleaning behavior in different initial and operating conditions. One-factorial analysis and statistical tests, including analysis of variance (ANOVA), were used to determine the significance of the different parameters. The most influential parameters have been selected and used as input to the representative model in order to predict pumpout volume and corresponding contamination. In this work, multiple data-driven models such as Neural network, Random Forest, and Gradient Boosting are presented. Furthermore, multiple mathematical equations have been compared to fit the contamination trend, and methods of estimating their best fit parameters are presented. Blind testing has been performed to evaluate performance of the developed models, showing promising results. The workflow, database, and the developed models can be used to perform forward modeling of sampling jobs in different reservoirs, drilling muds, and operating conditions for both wireline and logging while drilling (LWD). This enables an effective and practical job-planning tool implementation, whereby a tool string can be optimized to reduce sampling time while improving the quality. The state-of-the-art workflow deployed in a commercial reservoir simulator combines physics, programming, statistical analysis, and machine learning techniques to tackle the challenging problem of sampling. The workflow and data can be used during operations with various wireline formation testing (WFT) and LWD testing tools to optimize cleanup and sampling of formation fluids. Simulations of different realizations of reservoir properties, drilling mud invasion profiles, and cleanup operations also helped develop a useful and diverse pumpout database.