House mice communicate through ultrasonic vocalizations (USVs), which are above the range of human hearing (>20 kHz), and several automated methods have been developed for USV detection and classification. Here we evaluate their advantages and disadvantages in a full, systematic comparison. We compared the performance of four detection methods, DeepSqueak (DSQ), MUPET, USVSEG, and the Automatic Mouse Ultrasound Detector (A-MUD). Moreover, we compared these to human-based manual detection (considered as ground truth), and evaluated the inter-observer reliability. All four methods had comparable rates of detection failure, though A-MUD outperformed the others in terms of true positive rates for recordings with low or high signal-to-noise ratios. We also did a systematic comparison of existing classification algorithms, where we found the need to develop a new method for automating the classification of USVs using supervised classification, bootstrapping on Gammatone Spectrograms, and Convolutional Neural Networks algorithms with Snapshot ensemble learning (BootSnap). It successfully classified calls into 12 types, including a new class of false positives used for detection refinement. BootSnap provides enhanced performance compared to state-of-the-art tools, it has an improved generalizability, and it is freely available for scientific use.