Security has become a critical issue for complex and expensive systems and day-to-day situations. In this regard, the analysis of surveillance cameras is a critical issue usually restricted to the number of people devoted to such a task, their knowledge and judgment. Nonetheless, different approaches have arisen to automate this task in recent years. These approaches are mainly based on machine learning and benefit from developing neural networks capable of extracting underlying information from input videos. Despite how competent those networks have proved to be, developers must face the challenging task of defining both the architecture and hyperparameters that allow such networks to work adequately and optimize the use of computational resources. In short, this work proposes a model that generates, through a genetic algorithm, neural networks for behavior classification within videos. Two types of neural networks evolved as part of this work, shallow and deep, which are structured on dense and 3D convolutional layers. Each network requires a particular type of input data: the evolution of the pose of people in the video and video sequences, respectively. Shallow neural networks use a direct encoding approach to map each part of the chromosome into a phenotype. In contrast, deep neural networks use indirect encoding, blueprints representing entire networks, and modules to depict layers and their connections. Our approach obtained relevant results when tested on the Kranok-NV dataset and evaluated with standard metrics used for similar classification tasks.