With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications. We introduce an unsupervised method, top-down word discovery and segmentation (TopWORDS), for simultaneously discovering and segmenting words and phrases from large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses. TopWORDS is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora. When outputs from TopWORDS are fed into context analysis tools such as topic modeling, word embedding, and association pattern finding, the results are as good as or better than that from using outputs of a supervised segmentation method.ue to the explosive growth of the Internet technology and the public adoption of the Internet as a main culture media, a large amount of text data is available. It is more and more attractive for many researchers to extract information from diverse text data to create new knowledge. Biomedical researchers can gain understanding on how diseases, symptoms, and other features are spatially, temporally, and ethnically distributed and associated with each other by mining research articles and electronic medical records. Marketers can learn what consumers say about their products and services by analyzing online reviews and comments. Social scientists can discover hot events from news articles, web pages, blogs, and tweets and infer driving forces behind them. Historians can extract information about historical figures from historical documents: who they were, what they did, and what social relationships they had with other historical figures.For alphabet-based languages such as English, many successful learning methods have been proposed (see ref. 1 for a review). For character-based languages such as Chinese and other East Asian languages, effective learning algorithms are still limited. Chinese has a much larger "alphabet" and vocabulary than English: Zhonghua Zihai Dictionary (2) lists 87,019 distinct Chinese characters, of which 3,000 are commonly used; and the vocabulary of Chinese is an open set when named entities are included. Additionally, morphological variations in Latin-derived languages (e.g., uppercase or lowercase letters, tense and voice changes), which provide useful hints for text mining, do not exist in Chinese. Because there is no space between Chinese characters in each sentence, significant ambiguities are present in deciphering its meaning.There are two critical challenges in processing Chinese texts: (i) word segment...