Abstract-In this paper, we propose a method for extracting topics we were interested in over the course of the past 18 months from a closed-caption TV corpus. Each TV program is assigned one of the following genres: drama, informational or tabloid style program, music, movie, culture, news, variety, welfare, and sport. We focus on dramas and informational/tabloid style programs in this paper. As the results, we extracted some words or bigrams that formed part of a signature phrase of a heroine and the name of a hero in a popular drama.Index Terms-Topic detection, spoken language corpus, closed caption TV data, word frequency, Pearson's r.
I. INTRODUCTIONCorpora have become the most important resources for researches and applications related to natural language, and a variety of researches and applications for corpus-based computational linguistics, knowledge engineering, and language education have been reported in recent years [1], [2]. Corpora are becoming larger with the increase in machine-readable language resources such as Web pages, wired newspapers, and social media.Almost all existing corpora are "written language corpora," and only a few "spoken language corpora" such as the Corpus for Spontaneous Japanese (CSJ) [3] can be used for research purposes. To make a spoken language corpus, it is generally necessary to record and dictate voice data. Therefore, a significant amount of time and effort is required to collect and maintain a spoken language corpus as compared to a written corpus, which can be directly collected from Web pages, newspaper articles, and other written materials. Spoken language is used to keep communication in the main part of our intelligent activities. In the fields of computational linguistics, social science, and language education, there is a large significance for spoken corpora as the fundamental data type, and collections of spoken language corpora are currently in large demand.For our project, we are constructing a large-scale spoken language corpus from closed caption data transmitted through digital terrestrial broadcasting [4]. Over 70% of the Manuscript