Multilingual spoken language corpora are indispensable for research on areas of spoken language communication, such as speech-to-speech translation. The speech and natural language processing essential to multilingual spoken language research requires unified structure and annotation, such as tagging. In this study, we describe an experience with multilingual spoken language corpus development at our research institution, focusing in particular on speech recognition and natural language processing for speech translation of travel conversations. An integrated speech and language database, Spoken Language DataBase (SLDB) was planned and constructed. Basic Travel Expression Corpus (BTEC) was planned and constructed to cover a variety of situations and expressions. BTEC and SLDB are designed to be complementary. BTEC is a collection of Japanese sentences and their translations, and SLDB is a collection of transcriptions of bilingual spoken dialogs. Whereas BTEC covers a wide variety of travel domains, SLDB covers a limited domain, i.e., hotel situations. BTEC contains approximately 588k utterance-style expressions, while SLDB contains about 16k utterances. Machine-aided Dialogs (MAD) was developed as a development corpus, and both BTEC and SLDB can be used to handle MAD-type tasks. Field Experiment Data (FED) was developed as the evaluation corpus. We conducted an experiment, and based on analysis of our follow-up questionnaire, roughly half the subjects of the * ATR Spoken
In this paper, we propose a method for compiling travel information automatically. For the compilation, we focus on travel blogs, which are defined as travel journals written by bloggers in diary form. We consider that travel blogs are a useful information source for obtaining travel information, because many bloggers' travel experiences are written in this form. Therefore, we identified travel blogs in a blog database and extracted travel information from them. We have confirmed the effectiveness of our method by experiment. For the identification of travel blogs, we obtained scores of 38.1% for Recall and 86.7% for Precision. In the extraction of travel information from travel blogs, we obtained 74.0% for Precision at the top 100 extracted local products, thereby confirming that travel blogs are a useful source of travel information.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.