Abstract.This paper explores differences between male and female writing in a large subset of the British National Corpus covering a range of genres. Several classes of simple lexical and syntactic features that differ substantially according to author gender are identified, both in fiction and in non-fiction documents. In particular, we find significant differences between male-and female-authored documents in the use of pronouns and certain types of noun modifiers: although the total number of nominals used by male and female authors is virtually identical, females use many more pronouns and males use many more noun specifiers. More generally , it is found that even in formal writing, female writing exhibits greater usage of features identified by previous researchers as "involved" while male writing exhibits greater usage of features which have been identified as "informational". Finally, a strong correlation between the characteristics of male (female) writing and those of nonfiction (fiction) is demonstrated.
The problem of automatically determining the gender of a document's author would appear to be a more subtle problem than those of categorization by topic or authorship attribution. Nevertheless, it is shown that automated text categorization techniques can exploit combinations of simple lexical and syntactic features to infer the gender of the author of an unseen formal written document with approximately 80% accuracy. The same techniques can be used to determine if a document is fiction or non-fiction with approximately 98% accuracy.
Statistical authorship attribution has a long history, culminating in the use of modern machine learning classification methods. Nevertheless, most of this work suffers from the limitation of assuming a small closed set of candidate authors and essentially unlimited training text for each. Real-life authorship attribution problems, however, typically fall short of this ideal. Thus, following detailed discussion of previous work, three scenarios are considered here for which solutions to the basic attribution problem are inadequate. In the first variant, the profiling problem, there is no candidate set at all; in this case, the challenge is to provide as much demographic or psychological information as possible about the author. In the second variant, the needle-in-a-haystack problem, there are many thousands of candidates for each of whom we might have a very limited writing sample.In the third variant, the verification problem, there is no closed candidate set but there is one suspect; in this case, the challenge is to determine if the suspect is or is not the author. For each variant, it is shown how machine learning methods can be adapted to handle the special challenges of that variant.
Little work to date in sentiment analysis (classifying texts by 'positive' or 'negative' orientation) has attempted to use fine-grained semantic distinctions in features used for classification. We present a new method for sentiment classification based on extracting and analyzing appraisal groups such as "very good" or "not terribly funny". An appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on Appraisal Theory. Semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. We classify movie reviews using features based upon these taxonomies combined with standard "bag-of-words" features, and report state-of-the-art accuracy of 90.2%. In addition, we find that some types of appraisal appear to be more significant for sentiment classification than others.
Imagine that you have been given an important text of unknown authorship, and wish to know as much as possible about the unknown author (demographics, personality, cultural background, etc.), just by analyzing the given text. This authorship profiling problem is of growing importance in the current global information environmentapplications abound in forensics, security, and commercial settings. For example, authorship profiling can help police identify characteristics of the perpetrator of a crime when there are too few (or too many) specific suspects to consider. Similarly, large corporations may be interested in knowing what types of people like or dislike their products, based on analysis of blogs and online product reviews. The question we therefore ask is: How much can we discern about the author of a text simply by analyzing the text itself? It turns out that, with varying degrees of accuracy, we can say a great deal indeed. Unlike the problem of authorship attribution (determining the author of a text from a given candidate set), discussed recently in these pages by Li, Zheng, and Chen (2006), authorship profiling does not begin with a set of writing samples from known candidate authors. Instead, we exploit the sociolinguistic observation that different groups of people speaking or writing in a particular genre and in a particular language use that language differently (Chambers et al. 2004). That is, they vary in how often they use certain words or syntactic constructions (in addition to variation in, e.g., pronunciation or intonation). The particular profile dimensions we consider here are author gender (Argamon et al. 2003), age (Koppel et al. 2006), native language
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.