Advances in robotics are part of reducing the burden associated with manufacturing tasks in workers. For example, the cobot could be used as a “third-arm” during the assembling task. Thus, the necessity of designing new intuitive control modalities arises. This paper presents a foot gesture approach centered on robot control constraints to switch between four operating modalities. This control scheme is based on raw data acquired by an instrumented insole located at a human’s foot. It is composed of an inertial measurement unit (IMU) and four force sensors. Firstly, a gesture dictionary was proposed and, from data acquired, a set of 78 features was computed with a statistical approach, and later reduced to 3 via variance analysis ANOVA. Then, the time series collected data were converted into a 2D image and provided as an input for a 2D convolutional neural network (CNN) for the recognition of foot gestures. Every gesture was assimilated to a predefined cobot operating mode. The offline recognition rate appears to be highly dependent on the features to be considered and their spatial representation in 2D image. We achieve a higher recognition rate for a specific representation of features by sets of triangular and rectangular forms. These results were encouraging in the use of CNN to recognize foot gestures, which then will be associated with a command to control an industrial robot.