Two-Level Alignment by Words and Phrases Based on Syntactic Information

Kim, Seon Ho; Yoon, Juntae; Ra, Dong-Yul

doi:10.1007/978-3-540-24630-5_38

Cited by 3 publications

(6 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, autoregressive (AR) TTS models [1,2,3] find alignments by themselves using attention mechanisms. On the contrary, non-autoregressive (NAR) TTS family [4,5,6] uses external alignment search algorithms [7,8] and phoneme-wise duration predictors for length regulation. As a sentence can be spoken in various ways, representing and controlling the diversity of speech are also crucial issues for TTS.…”

Section: Introductionmentioning

confidence: 99%

A bibliometric analysis of publications by the School of Biological Sciences, Seoul National University, South Korea

Kim

2013

Scientometrics

View full text Add to dashboard Cite

show abstract

Section: Introductionmentioning

confidence: 99%

A bibliometric analysis of publications by the School of Biological Sciences, Seoul National University, South Korea

Kim

2013

Scientometrics

View full text Add to dashboard Cite

show abstract

“…For practical applicability, we extend Fre-Painter to a twostage TTS system. In conventional two-stage TTS system, an acoustic model generates a Mel-spectrogram as an intermediate representation [68], [69], and then a neural vocoder synthesizes an audio waveform from the Mel-spectrogram. Additionally, if audio super-resolution is performed using models that take an audio waveform as input, a total of three stages are involved.…”

Section: F Text-to-speech Synthesis With Audio Super-resolutionmentioning

confidence: 99%

Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder

Kim,

Lee,

Choi

et al. 2024

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tasks. Following these studies, we leverage these representations and investigate several masking strategies for neural audio superresolution. In this paper, we propose an upper-band masking strategy with the initialization of the mask token, which is simple but efficient for audio super-resolution. Furthermore, we propose a mix-ratio masking strategy that makes the model robust for input speech with various sampling rates. For practical applicability, we extend Fre-Painter to a text-to-speech system, which synthesizes high-resolution speech using low-resolution speech data. The experimental results demonstrate that Fre-Painter outperforms other neural audio super-resolution models.

show abstract

“…Within this approach, text embeddings are duplicated according to their pre-determined durations to align with speech frames. For training, ground truth durations are obtained from the pairs of text and speech using external monotonic alignment algorithms [7], [8]. During inference, when ground truth durations are inaccessible, an explicit duration predictor infers durations from text representations instead.…”

Section: B Alignment Modeling In Neural Ttsmentioning

confidence: 99%

“…Attention-based AR models, like those proposed in [1]- [3], operate using an AR model that predicts speech in a frame-by-frame manner and utilizes an attention mechanism to establish alignment. In contrast, duration-based NAR models, such as [4]- [6], require phoneme-wise duration to regulate speech frame length and generate frames in parallel, necessitating external alignment search algorithms [7], [8] and explicit duration predictors to obtain durations.…”

Section: Introductionmentioning

confidence: 99%

“…To capture this variability, neural TTS models should be able to express these diverse characteristics. Traditionally, deep generative models such as variational autoencoders (VAE) [5], [9], normalizing flows [5], [8], [10], and diffusion models [6], [11], [12] have been used to represent speech diversity. Additionally, auxiliary conditioning information is leveraged to control this variability, for example, by using a reference speech to represent the target speaker for zero-shot adaptive TTS.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The Department of Neurosurgery at Seoul National University: Past, Present, and Future

Kim¹

2003

Neurosurgery

View full text Add to dashboard Cite

Image-text retrieval is a task to search for the proper textual descriptions of the visual world and vice versa. One challenge of this task is the vulnerability to input image/text corruptions. Such corruptions are often unobserved during the training, and degrade the retrieval model's decision quality substantially. In this paper, we propose a novel image-text retrieval technique, referred to as robust visual semantic embedding (RVSE), which consists of novel image-based and text-based augmentation techniques called semantic-preserving augmentation for image (SPAug-I) and text (SPAug-T). Since SPAug-I and SPAug-T change the original data in a way that its semantic information is preserved, we enforce the feature extractors to generate semantic-aware embedding vectors regardless of the corruption, improving the model's robustness significantly. From extensive experiments using benchmark datasets, we show that RVSE outperforms conventional retrieval schemes in terms of image-text retrieval performance.1 Index Terms-image-text retrieval, data augmentation, robustness, image and text corruption INTRODUCTIONRecently, image-text retrieval, a task to find out images (sentence) that accurately describe a given sentence (image), has received special attention due to its wide range of applications such as image search, social networking service (SNS) hashtag/post generation, and semantic communication for Internet of Things (IoT), to name just a few [1,2,3,4,5]. Since it is in general very difficult to compare samples obtained from two different modalities (image and text), a projection of image and text to the common embedding space (a.k.a. visual semantic embedding (VSE) space) is required [1,2,6,7,8,9]. To generate the image and text embedding vectors, deep learning (DL)-based image and text feature extractors (i.e., ResNet and BERT) have been popularly used [10,11]. By comparing the obtained vectors, we can compute the similarity scores between available image-text pairs and then choose

show abstract

Two-Level Alignment by Words and Phrases Based on Syntactic Information

Cited by 3 publications

References 9 publications

A bibliometric analysis of publications by the School of Biological Sciences, Seoul National University, South Korea

A bibliometric analysis of publications by the School of Biological Sciences, Seoul National University, South Korea

Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder

The Department of Neurosurgery at Seoul National University: Past, Present, and Future

Contact Info

Product

Resources

About