The balanced corpus of contemporary written Japanese (BCCWJ) is Japan's first 100 million words balanced corpus. It consists of three subcorpora (publication subcorpus, library subcorpus, and special-purpose subcorpus) and covers a wide range of text registers including books in general, magazines, newspapers, governmental white papers, best-selling books, an internet bulletinboard, a blog, school textbooks, minutes of the national diet, publicity newsletters of local governments, laws, and poetry verses. A random sampling technique is utilized whenever possible in order to maximize the representativeness of the corpus. The corpus is annotated in terms of dual POS analysis, document structure, and bibliographical information. The BCCWJ is currently accessible in three different ways including Chunagon a web-based interface to the dual POS analysis data. Lastly, results of some pilot evaluation of the corpus with respect to the textual diversity are reported. The analyses include POS distribution, word-class distribution, entropy of orthography, sentence length, and variation of the adjective predicate. High textual diversity is observed in all these analyses.
Spoken monologues feature greater sentence length and structural complexity than do spoken dialogues. To achieve high parsing performance for spoken monologues, it could prove effective to simplify the structure by dividing a sentence into suitable language units. This paper proposes a method for dependency parsing of Japanese monologues based on sentence segmentation. In this method, the dependency parsing is executed in two stages: at the clause level and the sentence level. First, the dependencies within a clause are identified by dividing a sentence into clauses and executing stochastic dependency parsing for each clause. Next, the dependencies over clause boundaries are identified stochastically, and the dependency structure of the entire sentence is thus completed. An experiment using a spoken monologue corpus shows this method to be effective for efficient dependency parsing of Japanese monologue sentences.
There is an ongoing debate whether phenomena of disfluency (such as filled pauses) are produced communicatively. Clark and Fox Tree (Cognition 84(1):73-111, 2002) propose that filled pauses are words, and that different forms signal different lengths of delay. This paper evaluates this Filler-As-Words hypothesis by analyzing the distribution of self-addressed-questions or SAQs (such as "what's the word") in relation to filled pauses. We found that SAQs address different problems in different languages (most frequently about memory-retrieval in English and Chinese, and about appropriateness in Japanese). In relation to filled pauses, British but not American English uses "um" to signal a more severe problem than "uh". Chinese uses different filled pauses to signal the syntactic category of the problem constituent. Japanese uses different filled pauses to signal levels of interaction with the interlocuter. Overall, our data supports the Filler-As-Words hypothesis that filled pauses are used communicatively. However, the dimensions of its meanings vary across languages and dialects.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.