Companies regularly spend millions of dollars producing electronically-stored documents in legal matters. Over the past two decades, attorneys have been using a variety of technologies to conduct this exercise, and most recently, parties on both sides of the 'legal aisle' are accepting the use of machine learning techniques like text classification to cull massive volumes of data and to identify responsive documents for use in these matters. While text classification is regularly used to reduce the discovery costs in legal matters, text classification also faces a peculiar perception challenge: amongst lawyers, this technology is sometimes looked upon as a "black box." Put simply, very little information is provided for attorneys to understand why documents are classified as responsive. In recent years, a group of AI and Machine Learning researchers have been actively researching Explainable AI. In an explainable AI system, actions or decisions are human understandable. In legal 'document review' scenarios, a document can be identified as responsive, as long as one or more of the text snippets (small passages of text) in a document are deemed responsive. In these scenarios, if text classification can be used to locate these responsive snippets, then attorneys could easily evaluate the model's document classification decision. When deployed with defined and explainable results, text classification can drastically enhance the overall quality and speed of the document review process by reducing the time it takes to review documents. Moreover, explainable predictive coding provides lawyers with greater confidence in the results of that supervised learning task.This paper describes a framework for explainable text classification as a valuable tool in legal services: for enhancing the quality and efficiency of legal document review and for assisting in locating responsive snippets within responsive documents. This framework has been implemented in our legal analytics product, which has been used in hundreds of legal matters. We also report our experimental results using the data from an actual legal matter that used this type of document review.
US corporations regularly spend millions of dollars reviewing electronically-stored documents in legal matters. Recently, attorneys apply text classification to efficiently cull massive volumes of data to identify responsive documents for use in these matters. While text classification is regularly used to reduce the discovery costs of legal matters, it also faces a perception challenge: amongst lawyers, this technology is sometimes looked upon as a "black box." Put simply, no extra information is provided for attorneys to understand why documents are classified as responsive. In recent years, explainable machine learning has emerged as an active research area. In an explainable machine learning system, predictions or decisions made by a machine learning model are human understandable. In legal 'document review' scenarios, a document is responsive, because one or more of its small text snippets are deemed responsive. In these scenarios, if these responsive snippets can be located, then attorneys could easily evaluate the model's document classification decisions -this is especially important in the field of responsible AI. Our prior research identified that predictive models created using annotated training text snippets improved the precision of a model when compared to a model created using all of a set of documents' text as training. While interesting, manually annotating training text snippets is not generally practical during a legal document review. However, small increases in precision can drastically decrease the cost of large document reviews. Automating the identification of training text snippets without human review could then make the application of training text snippet-based models a practical approach. This paper proposes two simple machine learning methods to locate responsive text snippets within responsive documents without using human annotated training text snippets. The two methods were evaluated and compared with a document classification method using three datasets from actual legal matters. The results show that the two proposed methods outperform the document-level training classification method in identifying responsive text snippets in responsive documents. Additionally, the results suggest that we can automate the successful identification of training text snippets to improve the precision of our predictive models in legal document review and thereby help reduce the overall cost of review.
No abstract
Active learning is a popular methodology in text classificationknown in the legal domain as 'predictive coding' or 'Technology Assisted Review' or 'TAR' -due to its potential to minimize the required review effort to build effective classifiers. It is generally assumed that when building a classifier of data for legal purposes (such as production to an opposing party or identification of attorney-client privileged data), the seed set matters less as additional learning rounds are performed, thus in most existing relevant seed set studies the seed set is either built from a random document set or from synthetic documents. However, our recent empirical evaluation on a range of seed set selection strategies demonstrates that the seed set selection strategy can significantly impact predictive coding performance. It is unclear whether that conclusion applies to active learning for predictive coding. In this study, we try to answer that question by using extensive experimentation which examines the impact of popular seed set selection strategies in active learning, within a predictive coding exercise. Additionally, significant research has been devoted to achieving high levels of recall efficiently through continuous active learning strategies when there is an assumption that human review will continue until a certain recall is achieved. However, for reasons such as monetary costs, sensitivity of data (or lack thereof), or time to classify a population, this heavy human lift is often less than ideal for lawyers that are classifying a population for production to an opposing party or classifying a population for attorney-client privilege. Often the strategy is to, instead, minimize the human review effort and to classify a population efficiently with minimal human intervention. In these instances, the selection strategy may be different than what prior research suggests. In this study, we evaluate different active learning strategies against well-researched continuous active learning strategies for the purpose of determining efficient training methods for classifying large populations quickly and precisely. We study how random sampling, keyword models and clustering based seed set selection strategies combined together with top-ranked, uncertain, random, recall inspired, and hybrid active learning document selection strategies affect the performance of active learning for predictive coding. For the purpose of this study, we use the percentage of documents requiring review to reach 75% recall as the 'benchmark' metric to evaluate and compare our approaches. 75% is a commonly used recall threshold in the legal domain when using classifiers to designate documents for production. In most cases we find that seed set selection methods have a minor impact, though they do show significant impact in lower richness data sets or when choosing a top-ranked active learning selection strategy. Our results also show that active learning selection strategies implementing uncertainty, random, or 75% recall selection strategies has the p...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.