Topic Model or Topic Twaddle? Re-evaluating Semantic Interpretability Measures

Doogan, Caitlin; Buntine, Wray L.

doi:10.18653/v1/2021.naacl-main.300

Cited by 30 publications

(19 citation statements)

References 64 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Section: Methodsmentioning

confidence: 99%

“…For many of the learned thematic vectors we observed a strong degree of semantic overlap between the words/tokens (summarizing fitted topical vectors) and the ICD-9 diagnostic codes identified as being most strongly associated with the thematic vector. Subjectively, the following topical vectors demonstrated reasonable convergent/discriminant validity: (21,27,13,18,26,15,47,48,11,50,41,23,39,3,14,32,38,4,5,8,46,25,7,9,45). Below, we identified a subset of thematic vectors for which the words/tokens loading strongly on topical basis appeared semantically associated with assigned primary diagnostic codes, suggesting they may be measuring the same latent construct:…”

Section: Topic Model Summarization and Association Between Learned To...mentioning

confidence: 95%

“…This type of post-hoc expert informed inspection builds faith in the face validity of the learned model; however, it is criticized for lacking rigor compared with alternative approaches based on quantitative evaluation metrics. 12,13 Internal validation is another common approach for validating a fitted topic model (where model learned quantities are inspected with respect to internal robustness, stability, predictive, geometric or semantic properties). Several sensible internal validation schemes exist, for demonstrating topic model validity.…”

Section: Topic Model Validationmentioning

confidence: 99%

See 2 more Smart Citations

Using ICD-9 diagnostic codes for external validation of topic models derived from primary care electronic medical record clinical text data

Meaney

Escobar

Stukel

et al. 2023

Health Informatics J

View full text Add to dashboard Cite

Background/Objectives: Unsupervised topic models are often used to facilitate improved understanding of large unstructured clinical text datasets. In this study we investigated how ICD-9 diagnostic codes, collected alongside clinical text data, could be used to establish concurrent-, convergent- and discriminant-validity of learned topic models. Design/Setting: Retrospective open cohort design. Data were collected from primary care clinics located in Toronto, Canada between 01/01/2017 through 12/31/2020. Methods: We fit a non-negative matrix factorization topic model, with K = 50 latent topics/themes, to our input document term matrix (DTM). We estimated the magnitude of association between each Boolean-valued ICD-9 diagnostic code and each continuous latent topical vector. We identified ICD-9 diagnostic codes most strongly associated with each latent topical vector; and qualitatively interpreted how these codes could be used for external validation of the learned topic model. Results: The DTM consisted of 382,666 documents and 2210 words/tokens. We correlated concurrently assigned ICD-9 diagnostic codes with learned topical vectors, and observed semantic agreement for a subset of latent constructs (e.g. conditions of the breast, disorders of the female genital tract, respiratory disease, viral infection, eye/ear/nose/throat conditions, conditions of the urinary system, and dermatological conditions, etc.). Conclusions: When fitting topic models to clinical text corpora, researchers can leverage contemporaneously collected electronic medical record data to investigate the external validity of fitted latent variable models.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Topic Model Summarization and Association Between Learned To...mentioning

confidence: 95%

Section: Topic Model Validationmentioning

confidence: 99%

See 1 more Smart Citation

Using ICD-9 diagnostic codes for external validation of topic models derived from primary care electronic medical record clinical text data

Meaney

Escobar

Stukel

et al. 2023

Health Informatics J

View full text Add to dashboard Cite

show abstract

“…On the other hand, the topics discovered by ETM are more stable but have a lower coherence on average. As already observed in previous work (Al-Sumait et al, 2009;Doogan and Buntine, 2021), obtaining junk or mixed topics is common in topic models and this problem can be addressed by filtering out the topics that are less relevant.…”

Section: Qualitative Resultsmentioning

confidence: 85%

A Common Derivation for Parsing and Generation with Expectation-Based Minimalist Grammars (e-MGs)

Chesi¹

2022

Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021

View full text Add to dashboard Cite

The eighth edition of the Italian Conference on Computational Linguistics (CLiC-it 2021) was held at Università degli Studi di Milano-Bicocca from 26th to 28th January 2022.After the edition of 2020, which was held in fully virtual mode due to the health emergency related to Covid-19, CLiC-it 2021 represented the first moment for the Italian research community of Computational Linguistics to meet in person after more than one year of full/partial lockdown. Although the conference was held in dual mode, we strongly suggested the participants to attend it coming to Milan. Indeed, we received a strong feedback on this aspect from the community, which was eager to meet in person and enjoy both the scientific and social events together with the colleagues. In total, 99 participants registered to the conference benefiting from the early registration fee, 91 out of which expressed their intention to attend the event in person, which we consider as a very positive indication of enthusiasm from the community, given the uncertain situation due to the evolution of the pandemic in Italy.In total, we received 68 proposals, organized in the following specific tracks: Information Extraction,

show abstract

“… 2018 ; Hoyle et al. 2020 ; Doogan and Buntine 2021 ). Recent approaches to modelling short text datasets include the use of auxiliary metadata (Zhao et al.…”

Section: Introduction and Motivationsmentioning

confidence: 99%

A systematic review of the use of topic models for short text social media analysis

2023

View full text Add to dashboard Cite

Recently, research on short text topic models has addressed the challenges of social media datasets. These models are typically evaluated using automated measures. However, recent work suggests that these evaluation measures do not inform whether the topics produced can yield meaningful insights for those examining social media data. Efforts to address this issue, including gauging the alignment between automated and human evaluation tasks, are hampered by a lack of knowledge about how researchers use topic models. Further problems could arise if researchers do not construct topic models optimally or use them in a way that exceeds the models’ limitations. These scenarios threaten the validity of topic model development and the insights produced by researchers employing topic modelling as a methodology. However, there is currently a lack of information about how and why topic models are used in applied research. As such, we performed a systematic literature review of 189 articles where topic modelling was used for social media analysis to understand how and why topic models are used for social media analysis. Our results suggest that the development of topic models is not aligned with the needs of those who use them for social media analysis. We have found that researchers use topic models sub-optimally. There is a lack of methodological support for researchers to build and interpret topics. We offer a set of recommendations for topic model researchers to address these problems and bridge the gap between development and applied research on short text topic models.

show abstract

Topic Model or Topic Twaddle? Re-evaluating Semantic Interpretability Measures

Cited by 30 publications

References 64 publications

Using ICD-9 diagnostic codes for external validation of topic models derived from primary care electronic medical record clinical text data

Using ICD-9 diagnostic codes for external validation of topic models derived from primary care electronic medical record clinical text data

A Common Derivation for Parsing and Generation with Expectation-Based Minimalist Grammars (e-MGs)

A systematic review of the use of topic models for short text social media analysis

Contact Info

Product

Resources

About