Automatic text summarization is widely regarded as the highly difficult problem, partially because of the lack of large text summarization data set. Due to the great challenge of constructing the large scale summaries for full text, in this paper, we introduce a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which is released to the public 1 . This corpus consists of over 2 million real Chinese short texts with short summaries given by the author of each text. We also manually tagged the relevance of 10,666 short summaries with their corresponding short texts. Based on the corpus, we introduce recurrent neural network for the summary generation and achieve promising results, which not only shows the usefulness of the proposed corpus for short text summarization research, but also provides a baseline for further research on this topic.
Chromophoric water-soluble organic matter in atmospheric aerosols potentially plays an important role in aqueous reactions and light absorption by organics. The fluorescence and chemical-structural characteristics of the chromophoric water-soluble organic matter in submicron aerosols collected in urban, forest, and marine environments (Nagoya, Kii Peninsula, and the tropical Eastern Pacific) were investigated using excitation-emission matrices (EEMs) and a high-resolution aerosol mass spectrometer. A total of three types of water-soluble chromophores, two with fluorescence characteristics similar to those of humiclike substances (HULIS-1 and HULIS-2) and one with fluorescence characteristics similar to those of protein compounds (PLOM), were identified in atmospheric aerosols by parallel factor analysis (PARAFAC) for EEMs. We found that the chromophore components of HULIS-1 and -2 were associated with highly and less-oxygenated structures, respectively, which may provide a clue to understanding the chemical formation or loss of organic chromophores in atmospheric aerosols. Whereas HULIS-1 was ubiquitous in water-soluble chromophores over different environments, HULIS-2 was abundant only in terrestrial aerosols, and PLOM was abundant in marine aerosols. These findings are useful for further studies regarding the classification and source identification of chromophores in atmospheric aerosols.
The present study used a combination of solvent and solid-phase extractions to fractionate organic compounds with different polarities from total suspended particulates in Nagoya, Japan, and their optical characteristics were obtained on the basis of their UV-visible absorption spectra and excitation-emission matrices (EEMs). The relationship between their optical characteristics and chemical structures was investigated based on high-resolution aerosol mass spectra (HR-AMS spectra), soft ionization mass spectra and Fourier transform infrared (FT-IR) spectra. The major light-absorption organics were less polar organic fractions, which tended to have higher mass absorption efficiencies (MAEs) and lower wavelength dependent Ångström exponents (Å) than the more polar organic fractions. Correlation analyses indicate that organic compounds with O and N atoms may contribute largely to the total light absorption and fluorescence of the organic aerosol components. The extracts from the aerosol samples were further characterized by a classification of the EEM profiles using a PARAFAC model. Different fluorescence components in the aerosol organic EEMs were associated with specific AMS ions and with different functional groups from the FT-IR analysis. These results may be useful to determine and further classify the chromophores in atmospheric organic aerosols using EEM spectroscopy.
Drug-drug interaction (DDI) extraction as a typical relation extraction task in natural language processing (NLP) has always attracted great attention. Most state-of-the-art DDI extraction systems are based on support vector machines (SVM) with a large number of manually defined features. Recently, convolutional neural networks (CNN), a robust machine learning method which almost does not need manually defined features, has exhibited great potential for many NLP tasks. It is worth employing CNN for DDI extraction, which has never been investigated. We proposed a CNN-based method for DDI extraction. Experiments conducted on the 2013 DDIExtraction challenge corpus demonstrate that CNN is a good choice for DDI extraction. The CNN-based DDI extraction method achieves an F-score of 69.75%, which outperforms the existing best performing method by 2.75%.
For users' convenience, the source code of generating the profile-based proteins and the multiple kernel learning was also provided at http://bioinformatics.hitsz.edu.cn/main/~binliu/remote/
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.