Abstract:In this study, we address the interesting task of classifying historical texts by their assumed period of writ-ing. This task is useful in digital humanity studies where many texts have unidentified publication dates.For years, the typical approach for temporal text classification was supervised using machine-learningalgorithms. These algorithms require careful feature engineering and considerable domain expertise todesign a feature extractor to transform the raw text into a feature vector from which the clas… Show more
“…Traditional machine learning methods focus on statistical features and learning models, such as Naïve Bayes (Boldsen and Wahlberg, 2021), SVM (Garcia-Fernandez et al, 2011) and Random Forests (Ciobanu et al, 2013). Recent studies turn to deep learning methods, and the experiments show their superior performances compared to traditional machine learning ones (Kulkarni et al, 2018;Liebeskind and Liebeskind, 2020;Yu and Huangfu, 2019;Ren et al, 2022). Pre-trained models are also leveraged to represent texts for the dating task, such as Sentence-BERT (Massidda, 2020;Tian and Kübler, 2021) and RoBERTa .…”
Section: Related Workmentioning
confidence: 99%
“…One is to learn word representations by diachronic documents. Current research on word representation either learn static word embedding throughout the corpus (Liebeskind and Liebeskind, 2020;Yu and Huangfu, 2019), or learn dynamic word representations using pre-trained models (Tian and Kübler, 2021). However, neither of them takes into account the relation between time and word meaning.…”
Section: Introductionmentioning
confidence: 99%
“…Initial work on neural networkbased document modeling employ convolutional neural networks(CNN) or recurrent neural networks(RNN) (Liebeskind and Liebeskind, 2020;Yu and Huangfu, 2019), while recent research turns to pre-trained models like BERT (Tian and Kübler, 2021) or RoBERTa . However, these studies always treat time as a prediction target, but not a variable in modeling, which does not help to capture the temporal characteristics of diachronic documents.…”
Automatic text dating(ATD) is a challenging task since explicit temporal mentions usually do not appear in texts. Existing state-of-theart approaches learn word representations via language models, whereas most of them ignore diachronic change of words, which may affect the efforts of text modeling. Meanwhile, few of them consider text modeling for long diachronic documents. In this paper, we present a time-aware language model named TALM, to learn temporal word representations by transferring language models of general domains to those of time-specific ones. We also build a hierarchical modeling approach to represent diachronic documents by encoding them with temporal word representations. Experiments on a Chinese diachronic corpus show that our model effectively captures implicit temporal information of words, and outperforms state-of-the-art approaches in historical text dating as well. Our code is available at: https://github.com/coderlihong/text-dating.
“…Traditional machine learning methods focus on statistical features and learning models, such as Naïve Bayes (Boldsen and Wahlberg, 2021), SVM (Garcia-Fernandez et al, 2011) and Random Forests (Ciobanu et al, 2013). Recent studies turn to deep learning methods, and the experiments show their superior performances compared to traditional machine learning ones (Kulkarni et al, 2018;Liebeskind and Liebeskind, 2020;Yu and Huangfu, 2019;Ren et al, 2022). Pre-trained models are also leveraged to represent texts for the dating task, such as Sentence-BERT (Massidda, 2020;Tian and Kübler, 2021) and RoBERTa .…”
Section: Related Workmentioning
confidence: 99%
“…One is to learn word representations by diachronic documents. Current research on word representation either learn static word embedding throughout the corpus (Liebeskind and Liebeskind, 2020;Yu and Huangfu, 2019), or learn dynamic word representations using pre-trained models (Tian and Kübler, 2021). However, neither of them takes into account the relation between time and word meaning.…”
Section: Introductionmentioning
confidence: 99%
“…Initial work on neural networkbased document modeling employ convolutional neural networks(CNN) or recurrent neural networks(RNN) (Liebeskind and Liebeskind, 2020;Yu and Huangfu, 2019), while recent research turns to pre-trained models like BERT (Tian and Kübler, 2021) or RoBERTa . However, these studies always treat time as a prediction target, but not a variable in modeling, which does not help to capture the temporal characteristics of diachronic documents.…”
Automatic text dating(ATD) is a challenging task since explicit temporal mentions usually do not appear in texts. Existing state-of-theart approaches learn word representations via language models, whereas most of them ignore diachronic change of words, which may affect the efforts of text modeling. Meanwhile, few of them consider text modeling for long diachronic documents. In this paper, we present a time-aware language model named TALM, to learn temporal word representations by transferring language models of general domains to those of time-specific ones. We also build a hierarchical modeling approach to represent diachronic documents by encoding them with temporal word representations. Experiments on a Chinese diachronic corpus show that our model effectively captures implicit temporal information of words, and outperforms state-of-the-art approaches in historical text dating as well. Our code is available at: https://github.com/coderlihong/text-dating.
“…Existing NLP studies on historical documents primarily focus on tasks such as spelling normalization [18], [23], machine translation [24], and sequence labelling, including part-of-speech tagging [25] and named entity recognition [19], [26]. Recently, the success of deep neural networks has introduced new applications in this domain, including sentiment analysis [27], information retrieval [28], event extraction [29], [30], and text classification [31]. However, only a limited amount of research has been conducted on historical text summarization.…”
Section: Historical Natural Language Processing Applicationsmentioning
In recent years, pre-trained language models (PLMs) have shown remarkable advancements in the extractive summarization task across diverse domains. However, there remains a lack of research specifically in the historical domain. In this paper, we propose a novel method for extractive historical single-document summarization that leverages the potential of a domain-aware historical bidirectional language model, pre-trained on a large-scale historical corpus. Subsequently, we fine-tune the language model specifically for the task of extractive historical single-document summarization. One major challenge for this task is the lack of annotated datasets for historical summarization. To address this issue, we construct a dataset by collecting archived historical documents from the Centre Virtuel de la Connaissance sur l'Europe (CVCE) group at the University of Luxembourg. Furthermore, to better learn the structural features of the input documents, we use a sentence position embedding mechanism that enables the model to learn the position information of sentences. The overall experimental results on our historical dataset collected from the CVCE group show that our method outperforms recent state-of-the-art methods in terms of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores. To the best of our knowledge, this is the first work on extractive historical text summarization.
Research in computational textual aesthetics has shown that there are textual correlates of preference in prose texts. The present study investigates whether textual correlates of preference vary across different time periods (contemporary texts versus texts from the 19th and early 20th centuries). Preference is operationalized in different ways for the two periods, in terms of canonization for the earlier texts, and through sales figures for the contemporary texts. As potential textual correlates of preference, we measure degrees of (un)predictability in the distributions of two types of low-level observables, parts of speech and sentence length. Specifically, we calculate two entropy measures, Shannon Entropy as a global measure of unpredictability, and Approximate Entropy as a local measure of surprise (unpredictability in a specific context). Preferred texts from both periods (contemporary bestsellers and canonical earlier texts) are characterized by higher degrees of unpredictability. However, unlike canonicity in the earlier texts, sales figures in contemporary texts are reflected in global (text-level) distributions only (as measured with Shannon Entropy), while surprise in local distributions (as measured with Approximate Entropy) does not have an additional discriminating effect. Our findings thus suggest that there are both time-invariant correlates of preference, and period-specific correlates.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.