This chapter presents a theoretical framework and preliminary results for manual categorization of explicit certainty information in 32 English newspaper articles. Our contribution is in a proposed categorization model and analytical framework for certainty identification. Certainty is presented as a type of subjective information available in texts. Statements with explicit certainty markers were identified and categorized according to four hypothesized dimensions -level, perspective, focus, and time of certainty.The preliminary results reveal an overall promising picture of the presence of certainty information in texts, and establish its susceptibility to manual identification within the proposed four-dimensional certainty categorization analytical framework. Our findings are that the editorial sample group had a significantly higher frequency of markers per sentence than did the sample group of news stories. For editorials, high level of certainty, writer's point of view, and future and present time were the most populated categories. For news stories, the most common were high and moderate levels, directly involved third party's point of view, and past time. These patterns have positive practical implications for automation.Keywords: certainty, certainty identification, certainty categorization model, subjectivity, manual tagging, natural language processing, linguistics, information extraction, information retrieval; uncertainty, doubt, epistemic comments, evidentials, hedges, hedging, certainty expressions; levels of certainty, point of view, annotating opinions; newspaper article analysis, analysis of editorials.1 Analytical Framework
Introduction: What is Certainty Identification and Why is it Important?The fields of Information Extraction (IE) and Natural Language Processing (NLP) have not yet addressed the task of certainty identification. It presents an ongoing theoretical and implementation challenge. Even though the linguistics literature has abundant intellectual investigations of closely related concepts, it has not yet provided NLP with a holistic certainty identification approach that would include clear definitions, theoretical underpinnings, validated analysis results, and a vision for practical applications. Unravelling the potential and demonstrating the usefulness of certainty analysis in an information-seeking situation is the driving force behind this preliminary research effort.Certainty identification is defined here as an automated process of extracting information from certainty-qualified texts or individual statements along four hypothesized dimensions of certainty, namely:• what degree of certainty is indicated (LEVEL),• whose certainty is involved (PERSPECTIVE),• what the object of certainty is (FOCUS), and • what time the certainty is expressed (TIME).Some writers consciously strive to produce a particular effect of certainty due to training or overt instructions. Others may do it inadvertently. A writer's certainty level may remain constant in a text and be unnoticed by...