Chao Zhang scite author profile

Deep learning has revolutionized speech recognition, image recognition, and natural language processing since 2010, each involving a single modality in the input signal. However, many applications in artificial intelligence involve more than one modality. It is therefore of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, a technical review of the models and learning methods for multimodal intelligence is provided. The main focus is the combination of vision and natural language, which has become an important area in both computer vision and natural language processing research communities.This review provides a comprehensive analysis of recent work on multimodal deep learning from three new angles -learning multimodal representations, the fusion of multimodal signals at various levels, and multimodal applications. On multimodal representation learning, we review the key concept of embedding, which unifies the multimodal signals into the same vector space and thus enables cross-modality signal processing. We also review the properties of the many types of embedding constructed and learned for general downstream tasks. On multimodal fusion, this review focuses on special architectures for the integration of the representation of unimodal signals for a particular task. On applications, selected areas of a broad interest in current literature are covered, including caption generation, text-to-image generation, and visual question answering. We believe this review can facilitate future studies in the emerging field of multimodal intelligence for the community.

show abstract

Combination of deep speaker embeddings for diarisation

Sun

Zhang

Woodland

2021

Neural Networks

View full text Add to dashboard Cite

A distributed optimisation framework combining natural gradient with Hessian-free for discriminative sequence training

et al. 2021

View full text Add to dashboard Cite

Prosody Modelling With Pre-Trained Cross-Utterance Representations for Improved Speech Synthesis

Zhang

Song

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Minimising Biasing Word Errors for Contextual ASR With the Tree-Constrained Pointer Generator

Sun

Zhang

Woodland

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Contextual knowledge is essential for reducing speech recognition errors on high-valued long-tail words. This paper proposes a novel tree-constrained pointer generator (TCP-Gen) component that enables end-to-end ASR models to bias towards a list of long-tail words obtained using external contextual information. With only a small overhead in memory use and computation cost, TCPGen can structure thousands of biasing words efficiently into a symbolic prefix-tree, and creates a neural shortcut between the tree and the final ASR output to facilitate the recognition of the biasing words. To enhance TCPGen, we further propose a novel minimum biasing word error (MBWE) loss that directly optimises biasing word errors during training, along with a biasing-word-driven language model discounting (BLMD) method during the test. All contextual ASR systems were evaluated on the public Librispeech audiobook corpus and the data from the dialogue state tracking challenges (DSTC) with the biasing lists extracted from the dialogue-system ontology. Consistent word error rate (WER) reductions were achieved with TCPGen, which were particularly significant on the biasing words with around 40% relative reductions in the recognition error rates. MBWE and BLMD further improved the effectiveness of TCPGen, and achieved more significant WER reductions on the biasing words. TCPGen also achieved zero-shot learning of words not in the audio training set with large WER reductions on the out-of-vocabulary words in the biasing list.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Chao Zhang

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Combination of deep speaker embeddings for diarisation

A distributed optimisation framework combining natural gradient with Hessian-free for discriminative sequence training

Prosody Modelling With Pre-Trained Cross-Utterance Representations for Improved Speech Synthesis

Minimising Biasing Word Errors for Contextual ASR With the Tree-Constrained Pointer Generator

Contact Info

Product

Resources

About