In today's legal environment, lawsuits and regulatory investigations require companies to embark upon increasingly intensive data-focused engagements to identify, collect and analyze large quantities of data. When documents are staged for reviewwhere they are typically assessed for relevancy or privilegethe process can require companies to dedicate an extraordinary level of resources, both with respect to human resources, but also with respect to the use of technology-based techniques to intelligently sift through data. Companies regularly spend millions of dollars producing 'responsive' electronically-stored documents for these types of matters. For several years, attorneys have been using a variety of tools to conduct this exercise, and most recently, they are accepting the use of machine learning techniques like text classification (referred to as predictive coding in the legal industry) to efficiently cull massive volumes of data to identify responsive documents for use in these matters. In recent years, a group of AI and Machine Learning researchers have been actively researching Explainable AI. In an explainable AI system, actions or decisions are human understandable. In typical legal 'document review' scenarios, a document can be identified as responsive, as long as one or more of the text snippets (small passages of text) in a document are deemed responsive. In these scenarios, if predictive coding can be used to locate these responsive snippets, then attorneys could easily evaluate the model's document classification decision. When deployed with defined and explainable results, predictive coding can drastically enhance the overall quality and speed of the document review process by reducing the time it takes to review documents. Moreover, explainable predictive coding provides lawyers with greater confidence in the results of that supervised learning task. The authors of this paper propose the concept of explainable predictive coding and simple explainable predictive coding methods to locate responsive snippets within responsive documents. We also report our preliminary experimental results using the data from an actual legal matter that entailed this type of document review. The purpose of this paper is to demonstrate the feasibility of explainable predictive coding in the context of professional services in the legal space. Keywords-machine learning, text categorization, explainable AI, predictive coding, explainable predictive coding, legal document reviewI.
One type of machine learning, text classification, is now regularly applied in legal matters involving voluminous document populations because it can reduce the time and expense associated with the review of those documents. One form of machine learning -Active Learning -has drawn attention from the legal community because it offers the potential to make the machine learning process even more effective. Active Learning, applied to legal documents, is considered a new technology in the legal domain and is continuously applied to all documents in a legal matter until an insignificant number of relevant documents are left for review. This implementation is slightly different than traditional implementations of Active Learning where the process stops once achieving acceptable model performance. The purpose of this paper is twofold: (i) to question whether Active Learning actually is a superior learning methodology and (ii) to highlight the ways that Active Learning can be most effectively applied to real legal industry data. Unlike other studies, our experiments were performed against large data sets taken from recent, real-world legal matters covering a variety of areas. We conclude that, although these experiments show the Active Learning strategy popularly used in legal document review can quickly identify informative training documents, it becomes less effective over time. In particular, our findings suggest this most popular form of Active Learning in the legal arena, where the highest-scoring documents are selected as training examples, is in fact not the most efficient approach in most instances. Ultimately, a different Active Learning strategy may be best suited to initiate the predictive modeling process but not to continue through the entire document review.
Research has shown that Convolutional Neural Networks (CNN) can be effectively applied to text classification as part of a predictive coding protocol. That said, most research to date has been conducted on data sets with short documents that do not reflect the variety of documents in real world document reviews. Using data from four actual reviews with documents of varying lengths, we compared CNN with other popular machine learning algorithms for text classification, including Logistic Regression, Support Vector Machine, and Random Forest. For each data set, classification models were trained with different training sample sizes using different learning algorithms. These models were then evaluated using a large randomly sampled test set of documents, and the results were compared using precision and recall curves. Our study demonstrates that CNN performed well, but that there was no single algorithm that performed the best across the combination of data sets and training sample sizes. These results will help advance research into the legal profession's use of machine learning algorithms that maximize performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.