Authorship attribution is a growing field, moving from beginnings in linguistics to recent advances in text mining. Through this change came an increase in the capability of authorship attribution methods both in their accuracy and the ability to consider more difficult problems. Research into authorship attribution in the 19 ℎ century considered it difficult to determine the authorship of a document of fewer than 1000 words. By the 1990s this values had decreased to less than 500 words and in the early 21 century it was considered possible to determine the authorship of a document in 250 words. The need for this ever decreasing limit is exemplified by the trend towards many shorter communications rather than fewer longer communications, such as the move from traditional multi-page handwritten letters to shorter, more focused emails. This trend has also been shown in online crime, where many attacks such as phishing or bullying are performed using very concise language. Cybercrime messages have long been hosted on Internet Relay Chats (IRCs) which have allowed members to hide behind screen names and connect anonymously. More recently, Twitter and other short message based web services have been used as a hosting ground for online crimes. This paper presents some evaluations of current techniques and identifies some new preprocessing methods that can be used to enable authorship to be determined at rates significantly better than chance for documents of 140 characters or less, a format popularised by the micro-blogging website Twitter 1 . We show that the SCAP methodology performs extremely well on twitter messages and even with restrictions on the types of information allowed, such as the recipient of directed messages, still perform significantly higher than chance. Further to this, we show that 120 tweets per user is an important threshold, at which point adding more tweets per user gives a small but non-significant increase in accuracy.
Authorship Analysis aims to extract information about the authorship of documents from features within those documents. Typically, this is performed as a classification task with the aim of identifying the author of a document, given a set of documents of known authorship. Alternatively, unsupervised methods have been developed primarily as visualisation tools to assist the manual discovery of clusters of authorship within a corpus by analysts. However, there is a need in many fields for more sophisticated unsupervised methods to automate the discovery, profiling and organisation of related information through clustering of documents by authorship. An automated and unsupervised methodology for clustering documents by authorship is proposed in this paper. The methodology is named NUANCE, for n-gram Unsupervised Automated Natural Cluster Ensemble. Testing indicates that the derived clusters have a strong correlation to the true authorship of unseen documents.
Phishing fraudsters attempt to create an environment which looks and feels like a legitimate institution, while at the same time attempting to bypass filters and suspicions of their targets. This is a difficult compromise for the phishers and presents a weakness in the process of conducting this fraud. In this research, a methodology is presented that looks at the differences that occur between phishing websites from an authorship analysis perspective and is able to determine different phishing campaigns undertaken by phishing groups. The methodology is named USCAP, for Unsupervised SCAP, which builds on the SCAP methodology from supervised authorship and extends it for unsupervised learning problems. The phishing website source code is examined to generate a model that gives the size and scope of each of the recognized phishing campaigns. The USCAP methodology introduces the first time that phishing websites have been clustered by campaign in an automatic and reliable way, compared to previous methods which relied on costly expert analysis of phishing websites. Evaluation of these clusters indicates that each cluster is strongly consistent with a high stability and reliability when analyzed using new information about the attacks, such as the dates that the attack occurred on. The clusters found are indicative of different phishing campaigns, presenting a step towards an automated phishing authorship analysis methodology.
The Internet is a decentralized structure that offers speedy communication, has a global reach and provides anonymity, a characteristic invaluable for committing illegal activities. In parallel with the spread of the Internet, cybercrime has rapidly evolved from a relatively low volume crime to a common high volume crime. A typical example of such a crime is the spreading of spam emails, where the content of the email tries to entice the recipient to click a URL linking to a malicious Web site or downloading a malicious attachment. Analysts attempting to provide intelligence on spam activities quickly find that the volume of spam circulating daily is overwhelming; therefore, any intelligence gathered is representative of only a small sample, not of the global picture. While past studies have looked at automating some of these analyses using topic-based models, i.e. separating email clusters into groups with similar topics, our preliminary research investigates the usefulness of applying authorship-based models for this purpose. In the first phase, we clustered a set of spam emails using an authorship-based clustering algorithm. In the second phase, we analysed those clusters using a set of linguistic, structural and syntactic features. These analyses reveal that emails within each cluster were likely written by the same author, but that it is unlikely we have managed to group together all spam produced by each group. This problem of high purity with low recall, has been faced in past authorship research. While it is also a limitation of our research, the clusters themselves are still useful for the purposes of automating analysis, because they reduce the work needing to be performed. Our second phase revealed useful information on the group that can be utilized in future research for further analysis of such groups, for example, identifying further linkages behind spam campaigns.
PurposeEthnographic studies of cyber attacks typically aim to explain a particular profile of attackers in qualitative terms. The purpose of this paper is to formalise some of the approaches to build a Cyber Attacker Model Profile (CAMP) that can be used to characterise and predict cyber attacks.Design/methodology/approachThe paper builds a model using social and economic independent or predictive variables from several eastern European countries and benchmarks indicators of cybercrime within the Australian financial services system.FindingsThe paper found a very strong link between perceived corruption and GDP in two distinct groups of countries – corruption in Russia was closely linked to the GDP of Belarus, Moldova and Russia, while corruption in Lithuania was linked to GDP in Estonia, Latvia, Lithuania and Ukraine. At the same time corruption in Russia and Ukraine were also closely linked. These results support previous research that indicates a strong link between been legitimate economy and the black economy in many countries of Eastern Europe and the Baltic states. The results of the regression analysis suggest that a highly skilled workforce which is mobile and working in an environment of high perceived corruption in the target countries is related to increases in cybercrime even within Australia. It is important to note that the data used for the dependent and independent variables were gathered over a seven year time period, which included large economic shocks such as the global financial crisis.Originality/valueThis is the first paper to use a modelling approach to directly show the relationship between various social, economic and demographic factors in the Baltic states and Eastern Europe, and the level of card skimming and card not present fraud in Australia.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.