Data generators are applications that produce synthetic datasets, which are useful for testing data analytics applications, such as machine learning algorithms and information visualization techniques. Each data generator application has a different approach to generate data. Consequently, each one has functionality gaps that make it unsuitable for some tasks (e.g., lack of ways to create outliers and non-random noise). This paper presents a data generator application that aims to fill relevant gaps scattered across other applications, providing a flexible tool to assist researchers in exhaustively testing their techniques in more diverse ways. The proposed system allows users to define and compose known statistical distributions to produce the desired outcome, visualizing the behavior of the data in real-time to analyze if it has the characteristics needed for efficient testing. This paper presents in detail the tool functionalities and how to create datasets, as well as a usage scenario to illustrate the process of data creation. INDEX TERMS Synthetic dataset generator, benchmark datasets creation, data creation system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.