Attention-based encoder-decoder model has achieved impressive results for both automatic speech recognition (ASR) and text-tospeech (TTS) tasks. This approach takes advantage of the memorization capacity of neural networks to learn the mapping from the input sequence to the output sequence from scratch, without the assumption of prior knowledge such as the alignments. However, this model is prone to overfitting, especially when the amount of training data is limited. Inspired by SpecAugment and BERT, in this paper, we propose a semantic mask based regularization for training such kind of end-to-end (E2E) model. The idea is to mask the input features corresponding to a particular output token, e.g., a word or a word-piece, in order to encourage the model to fill the token based on the contextual information. While this approach is applicable to the encoder-decoder framework with any type of neural network architecture, we study the transformer-based model for ASR in this work. We perform experiments on Librispeech 960h and TedLium2 data sets, and achieve the state-of-the-art performance on the test set in the scope of E2E models.
Zero-shot learning aims to recognize unseen categories by learning an embedding space between data samples and semantic representations. For the large-scale datasets with thousands of categories, embedding vectors of category labels are often used for semantic representation since it is difficult to define the semantic attributes of categories manually. Facing the problem of underutilization of prior knowledge during the construction of embedding vectors, this paper first constructs a novel knowledge graph as the supplement to the basic WordNet graph, and then proposes a fast hybrid model ARGCN-DKG, which means Attention based Residual Graph Convolutional Network on Different types of Knowledge Graphs. By introducing residual mechanism and attention mechanism, and integrating different knowledge graphs, the accuracy of knowledge transfer between different categories can be improved. Our model only use 2-layer GCN, the pretrained image features and category semantic features, so the training process could be done in minitues on single GPU, which could be one of the fastest training models for large-scale image recognition. Experiment results demonstrate that ARGCN-DKG model could get better results for large-scale datasets than the state-of-the-art model.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.