DATGAN: Integrating expert knowledge into deep learning for synthetic tabular data

Lederrey, Gael; Hillel, Tim; Bierlaire, Michel

doi:10.48550/arxiv.2203.03489

Cited by 4 publications

(7 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• DATGAN [31] proposes DATGAN which is a novel architecture based on GAN using Directed Acyclic Graphs (DAGs) to model the information about the dataset. It uses LSTM cells to model expert knowledge using DAG.…”

Section: B Machine Learning Methods-based Modelsmentioning

confidence: 99%

Rigorous Experimental Analysis of Tabular Data Generated using TVAE and CTGAN

Yadav,

Gaur,

Madhukar

et al. 2024

IJACSA

View full text Add to dashboard Cite

Synthetic data generation research has been progressing at a rapid pace and novel methods are being designed every now and then. Earlier, statistical methods were used to learn the distributions of real data and then sample synthetic data from those distributions. Recent advances in generative models have led to more efficient modeling of complex high-dimensional datasets. Also, privacy concerns have led to the development of robust models with lesser risk of privacy breaches. Firstly, the paper presents a comprehensive survey of existing techniques for tabular data generation and evaluation matrices. Secondly, it elaborates on a comparative analysis of state-of-the-art synthetic data generation techniques, specifically CTGAN and TVAE for small, medium, and large-scale datasets with varying data distributions. It further evaluates the synthetic data using quantitative and qualitative metrics/techniques. Finally, this paper presents the outcomes and also highlights the issues and shortcomings which are still need to be addressed.

show abstract

Section: B Machine Learning Methods-based Modelsmentioning

confidence: 99%

Rigorous Experimental Analysis of Tabular Data Generated using TVAE and CTGAN

Yadav,

Gaur,

Madhukar

et al. 2024

IJACSA

View full text Add to dashboard Cite

show abstract

“…It is difficult for GAN to control the generation process of data-driven systems; therefore, integrating prior knowledge about data relationships and constraints can assist the generator in generating synopses that are realistic and meaningful. In order to implement this, DATGAN [38] incorporates expert knowledge into the GAN generator by matching the generator structure to the underlying data structure using a Directed Acyclic Graph (DAG). Using DAG, the nodes represent the columns of a data table, while the directed links between them allow the generator to determine the relationship between variables so that one column's generation influences another.…”

Section: Gan-based Tabular Generatormentioning

confidence: 99%

“…However, in AQP, it is not necessary to meet this threshold in order to generate realistic data synopses. DATGAN [38] uses the improved version of the Wasserstein loss function in WGAN [41] in addition to the Vanilla GAN loss function with a gradient penalty [42] and also adds the KL-divergence as an extra term to the original loss function. Both of these terms aim to minimize the difference between the probability distributions of real and generated data.…”

Section: Distribution Matchingmentioning

confidence: 99%

See 1 more Smart Citation

GAN-Based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and Solutions

Fallahian,

Dorodchi,

Kreth

2024

MAKE

View full text Add to dashboard Cite

In data-driven systems, data exploration is imperative for making real-time decisions. However, big data are stored in massive databases that are difficult to retrieve. Approximate Query Processing (AQP) is a technique for providing approximate answers to aggregate queries based on a summary of the data (synopsis) that closely replicates the behavior of the actual data; this can be useful when an approximate answer to queries is acceptable in a fraction of the real execution time. This study explores the novel utilization of a Generative Adversarial Network (GAN) for the generation of tabular data that can be employed in AQP for synopsis construction. We thoroughly investigate the unique challenges posed by the synopsis construction process, including maintaining data distribution characteristics, handling bounded continuous and categorical data, and preserving semantic relationships, and we then introduce the advancement of tabular GAN architectures that overcome these challenges. Furthermore, we propose and validate a suite of statistical metrics tailored for assessing the reliability of GAN-generated synopses. Our findings demonstrate that advanced GAN variations exhibit a promising capacity to generate high-fidelity synopses, potentially transforming the efficiency and effectiveness of AQP in data-driven systems.

show abstract

“…They focus on the generation of high-dimensional discrete variables (binary and count features). Lederrey et al [ 34 ] proposed DATGAN model to generate population data. They combined expertise and deep learning methods and used directed acyclic graph to identify the relationships between variables.…”

Section: Preliminariesmentioning

confidence: 99%

CTTGAN: Traffic Data Synthesizing Scheme Based on Conditional GAN

Wang

Yan

Liu

et al. 2022

Sensors

View full text Add to dashboard Cite

Most machine learning algorithms only have a good recognition rate on balanced datasets. However, in the field of malicious traffic identification, benign traffic on the network is far greater than malicious traffic, and the network traffic dataset is imbalanced, which makes the algorithm have a low identification rate for small categories of malicious traffic samples. This paper presents a traffic sample synthesizing model named Conditional Tabular Traffic Generative Adversarial Network (CTTGAN), which uses a Conditional Tabular Generative Adversarial Network (CTGAN) algorithm to expand the small category traffic samples and balance the dataset in order to improve the malicious traffic identification rate. The CTTGAN model expands and recognizes feature data, which meets the requirements of a machine learning algorithm for training and prediction data. The contributions of this paper are as follows: first, the small category samples are expanded and the traffic dataset is balanced; second, the storage cost and computational complexity are reduced compared to models using image data; third, discrete variables and continuous variables in traffic feature data are processed at the same time, and the data distribution is described well. The experimental results show that the recognition rate of the expanded samples is more than 0.99 in MLP, KNN and SVM algorithms. In addition, the recognition rate of the proposed CTTGAN model is better than the oversampling and undersampling schemes.

show abstract

DATGAN: Integrating expert knowledge into deep learning for synthetic tabular data

Cited by 4 publications

References 46 publications

Rigorous Experimental Analysis of Tabular Data Generated using TVAE and CTGAN

Rigorous Experimental Analysis of Tabular Data Generated using TVAE and CTGAN

GAN-Based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and Solutions

CTTGAN: Traffic Data Synthesizing Scheme Based on Conditional GAN

Contact Info

Product

Resources

About