Accurate characterization of keyhole pool behavior is important to improve the penetration recognition and quality detection during the variable polarity plasma arc welding (VPPAW) of aluminum alloy. However, the low-level hand-crafted visual/acoustic features are often incapable of sufficiently representing the dynamic characteristics of keyhole pool under complex welding conditions. In this paper, we developed an end-to-end visual-acoustic penetration recognition (VAPR) framework based on a hybrid convolutional neural network (CNN) and extreme learning machine (ELM), which consists of three consecutive phases: (1) visual-acoustic data preparation, (2) multi-features extraction, and (3) penetration classification. Specifically, we applied a flexible dual-sensor acquisition system for synchronously collecting/preprocessing the visual-acoustic signals and exploring the internal correlation between the visual-acoustic features and keyhole pool behavior as well as weld penetration status. Then we established two individual CNNs for learning high-level visual-acoustic features directly from the raw visual images and 2-D time-frequency spectrograms. With the visualization of CNN-based deep learning features, we also explained the physical meanings of visual-acoustic features. To further improve the prediction performance of CNN model, we employed the ELM model as a strong tool to classify penetration status. By comparing with other state-of-the-art methods, our hybrid CNN-ELM approach with visual-acoustic fusion has a superior performance in terms of the classifying accuracy. The VAPR framework in this paper will provide some guidance for post-process quality inspection and realize an adaptive control of VPPAW process.