We investigate the identification and analysis of linguistic (lexico-grammatical) features that are characteristically used by articles of a specific year of publication. Linguistic features differ from shallow features because they represent authors' lexico-grammatical writing styles and do not consider well-known bag-of-words model. Current literature focusses on shallow features rather than on linguistic features and existing methods for identifying linguistic features use well-known knowledge-structure based approaches. In contrast to this, we advance these existing methods by applying semantic clustering instead of using knowledge-structure based approaches. For evaluation purpose, a linguistic feature-based prediction model is built to enable an automated assignment of articles to their years of publication. In a case study, the proposed methodology is applied to articles of the Springer book series 'Communications in Computer and Information Science' published from 2009 to 2013. The Case study results show the feasibility of the proposed approach as compared to frequently used baseline.Keywords: Scientific articles, Linguistic features, Latent semantic indexing, Text Mining.
INTRODUCTIONWe investigate the occurrence of linguistic (lexico-grammatical) features in articles to show that they can be used for assigning articles to their years of publication. The Literature shows related approaches that can be used to assign articles to a pre-defined class. A domain-specific vocabulary (key words) is often used for this classification task. Different domains can be well distinguished by the distribution of specific key words as shown by existing bag-of-words approaches [1]- [5]. Further, trend analysis and bibliometric research also show that key word distributions can be used to identify a time period [6]. They trace topic changes over time within a domain. Thus, these approaches can estimate an article's publication year based on the used topics.The approaches as mentioned above are based on shallow (bag-of-words) features. They are in contrast to linguistic features such as specific word class distributions that indicate authors' lexico-grammatical writing styles. Literature also shows the possibilities of using linguistic features for classification. [7] investigate the impact of linguistic features on different scientific disciplines and on different points in time. A further approach uses linguistic features for spam detection [8]. Both approaches are based on systemic functional linguistics, in which a knowledge-structure based classifier (e.g. support vector machine) is used.We provide a new approach that identifies articles' linguistic features and that investigates their usage at different points in time. In contrast to previous work, clustering is used instead of classification. Text classification assigns a text to the given pre-defined classes. Classes are normally defined in a way that they cover all known linguistic features that are expected to occur within the given texts. Text clusteri...