Proceedings of the 18th ACM Conference on Information and Knowledge Management 2009
DOI: 10.1145/1645953.1646170
|View full text |Cite
|
Sign up to set email alerts
|

Text segmentation via topic modeling

Abstract: In this paper, the task of text segmentation is approached from a topic modeling perspective. We investigate the use of latent Dirichlet allocation (LDA) topic model to segment a text into semantically coherent segments. A major benefit of the proposed approach is that along with the segment boundaries, it outputs the topic distribution associated with each segment. This information is of potential use in applications like segment retrieval and discourse analysis. The new approach outperforms a standard baseli… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
62
0
2

Year Published

2012
2012
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 76 publications
(65 citation statements)
references
References 13 publications
1
62
0
2
Order By: Relevance
“…For the evaluation on the Choi dataset, the GRAPHSEG algorithm made use of the publicly available word embeddings built from a Google News dataset. 4 Both LDA-based models (Misra et al, 2009;Riedl and Biemann, 2012) and GRAPHSEG rely on corpus-derived word representations. Thus, we evaluated on the Manifesto dataset both the domainadapted and domain-unadapted variants of these methods.…”
Section: Experimental Settingmentioning
confidence: 99%
“…For the evaluation on the Choi dataset, the GRAPHSEG algorithm made use of the publicly available word embeddings built from a Google News dataset. 4 Both LDA-based models (Misra et al, 2009;Riedl and Biemann, 2012) and GRAPHSEG rely on corpus-derived word representations. Thus, we evaluated on the Manifesto dataset both the domainadapted and domain-unadapted variants of these methods.…”
Section: Experimental Settingmentioning
confidence: 99%
“…(Sun, Li, Luo& Wu, 2008;Zhang, Kang, Qian& Huang, 2014;Rangel, Faria, Lima & Oliveira, 2016) use LDA on a corpus of segments, inter-segment cipher similarities via a Fisher kernel, and optimize segmentation via dynamic programming. (Misra, Yvon, Jose, & Cappe, 2009;Glavaš, Nanni & Ponzetto, 2016) use a document-level LDA model, treat sections as new documents and predict their LDA models, and so do segmentation via dynamic programming with probabilistic scores. It is together a challenge to look out the useful data from the large documents (Aggarwal & Zhai, 2012;Zhai, & Massung, 2016).…”
Section: Background and Related Workmentioning
confidence: 99%
“…The traditional document cluster unit high-dimensional about texts. (Misra et al, 2009;Glavaš, Nanni & Ponzetto, 2016). The presence of logical structure clues within the document, scientific criteria and applied math similarity measures chiefly accustomed figure thematically coherent, contiguous text blocks in unstructured documents (Sun et al, 2008;Zhang et al, 2014;Rangel et al, 2016).…”
Section: Background and Related Workmentioning
confidence: 99%
“…The proposed solutions differ widely in the way of calculating the sentence-pair similarity (i.e., topical cohesiveness). Measures based on word co-occurrence [2,9,10] and generative models [1,18,20,23] have been extensively studied. The determination of the segment boundaries may not only be purely based on the local sentence-pair similarities but also be based on the global information derived from the distribution of the lexical similarities of the far neighboring sentences [2,10].…”
Section: Related Workmentioning
confidence: 99%