The Japanese-English aligned Basic Travel Expression Corpus (BTEC) has been used as a basic dataset for development of real-world Speech-to-Speech Translation (S2ST) systems in related prior studies. This paper presents a detailed statistical analysis on the Bengali translated BTEC text and its phonetic transcriptions for development of English-Bengali speech translation applications in travel domain. In different level of analysis hierarchy, the study focuses on the lexical and phonetical status of the analyzed corpus based on frequency spectrums, estimated population size, coverage ratio, goodness of fit of Large Number of Rare Events (LNRE) model and transition patterns . The experimental observations provide necessary insights on sufficiency of the analyzed corpus with respect to the travel domain as well as for building basic components of English-Bengali S2ST system.The Universal Speech Translation Advanced Research (U ST AR) Consortium is an international research consortium comprised of 29 institutes from 23 countries (as of June, 2014) across the world. It provides a uniform framework for pursuing collaborative research; sharing resources and developing infrastructures and resources for building web based speech-to speech translation (S2ST) applications in English to native languages of the respective participating institutes. Under the same framework, since 2014, Centre for Development of Advanced Computing, Kolkata (in India) is developing resources and components required for English to Bengali speech translation focusing on the travel domain. As the prime common dataset to build the language specific components of cross language S2ST applications under U-ST AR framework, the Basic Travel Expression Corpus (BTEC) has been used and translated into several native languages by the respective member institutes.As reported in earlier studies, the BTEC, developed by Advanced Telecommunication Research (ATR) laboratory, Japan, is an English-Japanese aligned corpus [1,2]. It is qualitatively well-examined, consistent and having wide coverage of travel and tourism related basic expressions. These expressions are highly oriented towards real world conversations on various topics like transportation, travel activities, sight-seeing, shopping, dining, staying, asking help in trouble, basic queries, greetings etc. much likely to be spoken by tourists travelling to other countries. In prior works, the corpus has been quantitatively evaluated and statistically analyzed in languages like Japanese [2] and Hindi [3]. The statistical analysis presented in this paper, is carried out on the Bengali translated version of the BTEC and is being reported for the first time. To introduce Bengali language, it is good to mention here that Bengali belongs to the eastern Indo-Aryan branch of the Indo-European family of languages. It is the official state language of East Indian state West Bengal and the national language of Bangladesh. As per Wiki, it is one of the most spoken languages, ranked seventh in world.
II. PREPARING B EN...