Object skeleton is a useful cue for object detection, complementary to the object contour, as it provides a structural representation to describe the relationship among object parts. While object skeleton extraction in natural images is a very challenging problem, as it requires the extractor to be able to capture both local and global image context to determine the intrinsic scale of each skeleton pixel. Existing methods rely on per-pixel based multi-scale feature computation, which results in difficult modeling and high time consumption. In this paper, we present a fully convolutional network with multiple scale-associated side outputs to address this problem. By observing the relationship between the receptive field sizes of the sequential stages in the network and the skeleton scales they can capture, we introduce a scale-associated side output to each stage. We impose supervision to different stages by guiding the scale-associated side outputs toward groundtruth skeletons of different scales. The responses of the multiple scaleassociated side outputs are then fused in a scale-specific way to localize skeleton pixels with multiple scales effectively. Our method achieves promising results on two skeleton extraction datasets, and significantly outperforms other competitors.
This paper investigates the approaches of building WaveNet vocoders with limited training data for voice conversion (VC). Current VC systems using statistical acoustic models always suffer from the quality degradation of converted speech. One of the major causes is the use of hand-crafted vocoders for waveform generation. Recently, with the emergence of WaveNet for waveform modeling, speaker-dependent WaveNet vocoders have been proposed and they can reconstruct speech with better quality than conventional vocoders, such as STRAIGHT. Because training a WaveNet vocoder in the speaker-dependent way requires a relatively large training dataset, it remains a challenge to build a high-quality WaveNet vocoder for VC tasks when the training data of target speakers is limited. In this paper, we propose to build WaveNet vocoders by combining the initialization using a multi-speaker corpus and the adaptation using a small amount of target data, and evaluate this proposed method on the Voice Conversion Challenge (VCC) 2018 dataset which contains approximately 5 minute recordings for each target speaker. Experimental results show that the WaveNet vocoders built using our proposed method outperform conventional STRAIGHT vocoder. Furthermore, our system achieves an average naturalness MOS of 4.13 in VCC 2018, which is the highest among all submitted systems.
Abstract-Object skeletons are useful for object representation and object detection. They are complementary to the object contour, and provide extra information, such as how object scale (thickness) varies among object parts. But object skeleton extraction from natural images is very challenging, because it requires the extractor to be able to capture both local and non-local image context in order to determine the scale of each skeleton pixel. In this paper, we present a novel fully convolutional network with multiple scale-associated side outputs to address this problem. By observing the relationship between the receptive field sizes of the different layers in the network and the skeleton scales they can capture, we introduce two scale-associated side outputs to each stage of the network. The network is trained by multitask learning, where one task is skeleton localization to classify whether a pixel is a skeleton pixel or not, and the other is skeleton scale prediction to regress the scale of each skeleton pixel. Supervision is imposed at different stages by guiding the scaleassociated side outputs toward the groundtruth skeletons at the appropriate scales. The responses of the multiple scale-associated side outputs are then fused in a scale-specific way to detect skeleton pixels using multiple scales effectively. Our method achieves promising results on two skeleton extraction datasets, and significantly outperforms other competitors. Additionally, the usefulness of the obtained skeletons and scales (thickness) are verified on two object detection applications: Foreground object segmentation and object proposal detection.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.