Unsupervised Text Segmentation Using Semantic Relatedness Graphs

Glavaš, Goran; Nanni, Federico; Ponzetto, Simone Paolo

doi:10.18653/v1/s16-2016

Cited by 77 publications

(84 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Alemi and Ginsparg (2015) and Naili et al (2017) studied how word embeddings can improve classical segmentation approaches. Glavaš et al (2016) utilized semantic relatedness of word embeddings by identifying cliques in a graph.…”

Section: Related Workmentioning

confidence: 99%

SECTOR: A Neural Model for Coherent Topic Segmentation and Classification

Arnold

Schneider

Cudré-Mauroux

et al. 2019

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

When searching for information, a human reader first glances over a document, spots relevant sections and then focuses on a few sentences for resolving her intention. However, the high variance of document structure complicates to identify the salient topic of a given section at a glance. To tackle this challenge, we present SECTOR, a model to support machine reading systems by segmenting documents into coherent sections and assigning topic labels to each section. Our deep neural network architecture learns a latent topic embedding over the course of a document. This can be leveraged to classify local topics from plain text and segment a document at topic shifts. In addition, we contribute WikiSection, a publicly available dataset with 242k labeled sections in English and German from two distinct domains: diseases and cities. From our extensive evaluation of 20 architectures, we report a highest score of 71.6% F1 for the segmentation and classification of 30 topics from the English city domain, scored by our SECTOR LSTM model with bloom filter embeddings and bidirectional segmentation. This is a significant improvement of 29.5 points F1 compared to state-of-the-art CNN classifiers with baseline segmentation. 1 Our source code is available under the Apache License 2.0 at https

show abstract

Section: Related Workmentioning

confidence: 99%

SECTOR: A Neural Model for Coherent Topic Segmentation and Classification

Arnold

Schneider

Cudré-Mauroux

et al. 2019

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…By treating source code 1 https://stackoverflow.com as a body of text akin to an NLP problem, we avoid any programming-language-specific challenges posed by other methods. Text segmentation has been researched more thoroughly than the source code analogue, with methods ranging from LDA [10], to semantic relatedness graphs [3], to deep learning approaches [1]. Of particular note is the use of bidirectional LSTMs to identify the breaks between segments of Wikipedia articles [8].…”

Section: Related Workmentioning

confidence: 99%

Logical Segmentation of Source Code

Dormuth¹,

Gelman²,

Moore

et al. 2019

International Conferences on Software Engineering and Knowledge Engineering

View full text Add to dashboard Cite

Many software analysis methods have come to rely on machine learning approaches. Code segmentation -the process of decomposing source code into meaningful blockscan augment these methods by featurizing code, reducing noise, and limiting the problem space. Traditionally, code segmentation has been done using syntactic cues; current approaches do not intentionally capture logical content. We develop a novel deep learning approach to generate logical code segments regardless of the language or syntactic correctness of the code. Due to the lack of logically segmented source code, we introduce a unique data set construction technique to approximate ground truth for logically segmented code. Logical code segmentation can improve tasks such as automatically commenting code, detecting software vulnerabilities, repairing bugs, labeling code functionality, and synthesizing new code.

show abstract

“…It calculates an error rate between 0 and 1 for predicting borders (0 indicates a perfect prediction), by penalizing near-misses less than other/complete misses or extra borders. Depending on the problem types and data sets used, text segmentation approaches report near-perfect windowDiff values of less than 0.01, while on the other side, the error rate exceeds values of 0.6 and higher under certain circumstances [6]. A more recent adaption of the WindowDiff metric is the WinPR metric [28].…”

Section: Style Breach Detectionmentioning

confidence: 99%

Overview of PAN’17

Potthast

Rangel

Tschuggnall

et al. 2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The PAN 2017 shared tasks on digital text forensics were held in conjunction with the annual CLEF conference. This paper gives a high-level overview of each of the three shared tasks organized this year, namely author identification, author profiling, and author obfuscation. For each task, we give a brief summary of the evaluation data, performance measures, and results obtained. Altogether, 29 participants submitted a total of 33 pieces of software for evaluation, whereas 4 participants submitted to more than one task. All submitted software has been deployed to the TIRA evaluation platform, where it remains hosted for reproducibility purposes.

show abstract

Unsupervised Text Segmentation Using Semantic Relatedness Graphs

Cited by 77 publications

References 17 publications

SECTOR: A Neural Model for Coherent Topic Segmentation and Classification

SECTOR: A Neural Model for Coherent Topic Segmentation and Classification

Logical Segmentation of Source Code

Overview of PAN’17

Contact Info

Product

Resources

About