Deep generative models have recently achieved impressive performance in speech and music synthesis. However, compared to the generation of those domain-specific sounds, generating general sounds (such as siren, gunshots) has received less attention, despite their wide applications. In previous work, the SampleRNN method was considered for sound generation in the time domain. However, SampleRNN is potentially limited in capturing long-range dependencies within sounds as it only back-propagates through a limited number of samples. In this work, we propose a method for generating sounds via neural discrete time-frequency representation learning, conditioned on sound classes. This offers an advantage in efficiently modelling long-range dependencies and retaining local fine-grained structures within sound clips. We evaluate our approach on the UrbanSound8K dataset, compared to SampleRNN, with the performance metrics measuring the quality and diversity of generated sounds. Experimental results show that our method offers comparable performance in quality and significantly better performance in diversity.
Thousands of ILs with the potential to efficiently dissolve hemicellulose were screened by COSMO-RS, and the best model of hemicellulose was constructed and verified. This screening method will play an important role in sustainable development.
Although the power conversion efficiency values of perovskite solar cells continue to be refreshed, it is still far from the theoretical Shockley-Queisser limit. Two major issues need to be addressed, including disorder crystallization of perovskite and unbalanced interface charge extraction, which limit further improvements in device efficiency. Herein, we develop a thermally polymerized additive as the polymer template in the perovskite film, which can form monolithic perovskite grain and a unique “Mortise-Tenon” structure after spin-coating hole-transport layer. Importantly, the suppressed non-radiative recombination and balanced interface charge extraction benefit from high-quality perovskite crystals and Mortise-Tenon structure, resulting in enhanced open-circuit voltage and fill-factor of the device. The PSCs achieve certified efficiency of 24.55% and maintain >95% initial efficiency over 1100 h in accordance with the ISOS-L-2 protocol, as well as excellent endurance according to the ISOS-D-3 accelerated aging test.
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model at: https://github.com/Audio-AGI/AudioSep.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.