2016
DOI: 10.1073/pnas.1516510113
|View full text |Cite
|
Sign up to set email alerts
|

On the unsupervised analysis of domain-specific Chinese texts

Abstract: With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications. We introduce an unsupervised method, top-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
16
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
4
2
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 23 publications
(19 citation statements)
references
References 35 publications
0
16
0
Order By: Relevance
“…Further research is needed to verify whether this result holds when a greater number of countries are included in the analysis. However, it potentially offers a powerful approach for rapidly assessing country commitments around the globe on a repeat-basis based on openly accessible data sources, provided language barriers in text-mining applications can be overcome [40] notably for Chinese characters [41]. Developing national-level, globally applicable measures such as these is critical for tracking national progress on SDG attainment and stimulating business buy-in [42].…”
Section: Semi and Fully Automated Methodologiesmentioning
confidence: 99%
“…Further research is needed to verify whether this result holds when a greater number of countries are included in the analysis. However, it potentially offers a powerful approach for rapidly assessing country commitments around the globe on a repeat-basis based on openly accessible data sources, provided language barriers in text-mining applications can be overcome [40] notably for Chinese characters [41]. Developing national-level, globally applicable measures such as these is critical for tracking national progress on SDG attainment and stimulating business buy-in [42].…”
Section: Semi and Fully Automated Methodologiesmentioning
confidence: 99%
“…In this section, we propose a Domain Top-Words model. We introduce the Word Dictionary Model (Ge et al, 1999;Chang and Su, 1997;Cohen et al, 2007) and TopWords model proposed by Deng et al (2016) in subsection 3.1 and 3.2. Then we introduce our Domain TopWords model in subsection 3.3, 3.4 and 3.5.…”
Section: Methodsmentioning
confidence: 99%
“…These supervised models cannot be used in domain-specific words detection directly, due to the lack of annotated domain-specific data. In addition, there are also some unsupervised models, such as Top-Words proposed by Deng et al (2016). However, it needs time-consuming post-processing to extract the second type of domain-specific words.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The frequent patterns are itemsets that appear in a dataset no less than a user-specified threshold. In recent years, frequent pattern mining has been introduced in NLP tasks as a necessary preprocessing method for extracting quality phrases from corpora [14,35]. In our work, the task of frequent pattern mining is conducted on the transaction dataset of T. Given the database T and the predefined threshold of minimum support = min s , the object of frequent pattern mining is to collect all the n-itemsets whose frequency is larger than min s into set FP (n) (T).…”
Section: Mining Frequent N-gram Word Stringsmentioning
confidence: 99%