Imagine a robot is shown new concepts visually together with spoken tags, e.g. "milk", "eggs", "butter". After seeing one paired audiovisual example per class, it is shown a new set of unseen instances of these objects, and asked to pick the "milk". Without receiving any hard labels, could it learn to match the new continuous speech input to the correct visual instance? Although unimodal one-shot learning has been studied, where one labelled example in a single modality is given per class, this example motivates multimodal oneshot learning. Our main contribution is to formally define this task, and to propose several baseline and advanced models. We use a dataset of paired spoken and visual digits to specifically investigate recent advances in Siamese convolutional neural networks. Our best Siamese model achieves twice the accuracy of a nearest neighbour model using pixel-distance over images and dynamic time warping over speech in 11-way cross-modal matching.
In zero-resource settings where transcribed speech audio is unavailable, unsupervised feature learning is essential for downstream speech processing tasks. Here we compare two recent methods for frame-level acoustic feature learning. For both methods, unsupervised term discovery is used to find pairs of word examples of the same unknown type. Dynamic programming is then used to align the feature frames between each word pair, serving as weak top-down supervision for the two models. For the correspondence autoencoder (CAE), matching frames are presented as input-output pairs. The Triamese network uses a contrastive loss to reduce the distance between frames of the same predicted word type while increasing the distance between negative examples. For the first time, these feature extractors are compared on the same discrimination tasks using the same weak supervision pairs. We find that, on the two datasets considered here, the CAE outperforms the Triamese network. However, we show that a new hybrid correspondence-Triamese approach (CTriamese), consistently outperforms both the CAE and Triamese models in terms of average precision and ABX error rates on both English and Xitsonga evaluation data.
Condition monitoring of machine tool inserts is important for increasing the reliability and quality of machining operations. Various methods have been proposed for effective tool condition monitoring (TCM), and currently it is generally accepted that the indirect sensor-based approach is the best practical solution to reliable TCM. Furthermore, in recent years, neural networks (NNs) have been shown to model successfully, the complex relationships between input feature sets of sensor signals and tool wear data. NNs have several properties that make them ideal for effectively handling noisy and even incomplete data sets. There are several NN paradigms which can be combined to model static and dynamic systems. Another powerful method of modeling noisy dynamic systems is by using hidden Markov models (HMMs), which are commonly employed in modern speech-recognition systems. The use of HMMs for TCM was recently proposed in the literature. Though the results of these studies were quite promising, no comparative results of competing methods such as NNs are currently available. This paper is aimed at presenting a comparative evaluation of the performance of NNs and HMMs for a TCM application. The methods are employed on exactly the same data sets obtained from an industrial turning operation.The advantages and disadvantages of both methods are described, which will assist the condition-monitoring community to choose a modeling method for other applications.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.