Empirical evaluations of preprocessing parameters' impact on predictive coding's effectiveness

Chhatwal, Rishi; Huber-Fliflet, Nathaniel; Keeling, Robert; Zhang, Jianping; Zhao, Haixing

doi:10.1109/bigdata.2016.7840747

Cited by 21 publications

(17 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The machine learning algorithm we used to generate models was Logistic Regression. One of our prior studies demonstrated that predictive models generated with Logistic Regression perform very well on legal matter documents [13]. Other parameters we used for modeling were, bag of words with 1-gram and normalized frequency, and 20,000 tokens were used as features.…”

Section: Methodsmentioning

confidence: 99%

Empirical evaluations of active learning strategies in legal document review

Chhatwal¹,

Huber-Fliflet²,

Keeling

et al. 2017

2017 IEEE International Conference on Big Data (Big Data)

Self Cite

View full text Add to dashboard Cite

One type of machine learning, text classification, is now regularly applied in legal matters involving voluminous document populations because it can reduce the time and expense associated with the review of those documents. One form of machine learning -Active Learning -has drawn attention from the legal community because it offers the potential to make the machine learning process even more effective. Active Learning, applied to legal documents, is considered a new technology in the legal domain and is continuously applied to all documents in a legal matter until an insignificant number of relevant documents are left for review. This implementation is slightly different than traditional implementations of Active Learning where the process stops once achieving acceptable model performance. The purpose of this paper is twofold: (i) to question whether Active Learning actually is a superior learning methodology and (ii) to highlight the ways that Active Learning can be most effectively applied to real legal industry data. Unlike other studies, our experiments were performed against large data sets taken from recent, real-world legal matters covering a variety of areas. We conclude that, although these experiments show the Active Learning strategy popularly used in legal document review can quickly identify informative training documents, it becomes less effective over time. In particular, our findings suggest this most popular form of Active Learning in the legal arena, where the highest-scoring documents are selected as training examples, is in fact not the most efficient approach in most instances. Ultimately, a different Active Learning strategy may be best suited to initiate the predictive modeling process but not to continue through the entire document review.

show abstract

Section: Methodsmentioning

confidence: 99%

Empirical evaluations of active learning strategies in legal document review

Chhatwal¹,

Huber-Fliflet²,

Keeling

et al. 2017

2017 IEEE International Conference on Big Data (Big Data)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Both document and rationale models were deployed to assign probability scores to each snippet of a responsive document. M (1,2,3,4,5) top scoring snippets were selected as the identified rationales for each model. An identified snippet is a true rationale if it overlaps with the annotated rationale identified by the attorney reviewer.…”

Section: Results Of the Experimentsmentioning

confidence: 99%

“…Our prior studies demonstrated that predictive models generated with Logistic Regression perform very well on legal matter documents [2,10]. Other parameters used for modeling were bag of words with 1-gram and normalized frequency [2]. The results reported in the next section are averaged over a fivefold cross validation.…”

Section: B Experiments Designmentioning

confidence: 99%

Explainable Text Classification in Legal Document Review A Case Study of Explainable Predictive Coding

Chhatwal

Gronvall

Huber-Fliflet

et al. 2018

2018 IEEE International Conference on Big Data (Big Data)

Self Cite

View full text Add to dashboard Cite

In today's legal environment, lawsuits and regulatory investigations require companies to embark upon increasingly intensive data-focused engagements to identify, collect and analyze large quantities of data. When documents are staged for reviewwhere they are typically assessed for relevancy or privilegethe process can require companies to dedicate an extraordinary level of resources, both with respect to human resources, but also with respect to the use of technology-based techniques to intelligently sift through data. Companies regularly spend millions of dollars producing 'responsive' electronically-stored documents for these types of matters. For several years, attorneys have been using a variety of tools to conduct this exercise, and most recently, they are accepting the use of machine learning techniques like text classification (referred to as predictive coding in the legal industry) to efficiently cull massive volumes of data to identify responsive documents for use in these matters. In recent years, a group of AI and Machine Learning researchers have been actively researching Explainable AI. In an explainable AI system, actions or decisions are human understandable. In typical legal 'document review' scenarios, a document can be identified as responsive, as long as one or more of the text snippets (small passages of text) in a document are deemed responsive. In these scenarios, if predictive coding can be used to locate these responsive snippets, then attorneys could easily evaluate the model's document classification decision. When deployed with defined and explainable results, predictive coding can drastically enhance the overall quality and speed of the document review process by reducing the time it takes to review documents. Moreover, explainable predictive coding provides lawyers with greater confidence in the results of that supervised learning task. The authors of this paper propose the concept of explainable predictive coding and simple explainable predictive coding methods to locate responsive snippets within responsive documents. We also report our preliminary experimental results using the data from an actual legal matter that entailed this type of document review. The purpose of this paper is to demonstrate the feasibility of explainable predictive coding in the context of professional services in the legal space. Keywords-machine learning, text categorization, explainable AI, predictive coding, explainable predictive coding, legal document reviewI.

show abstract

“…For CNN, we used Keras sequence model tokenizer to prepare inputs for the CNN algorithm from the training sets with the specified vocabulary size and sequence length. The same text preprocessing function [2] was used for the SVM, LR, and FR algorithms. The text preprocessing parameters we used consisted of the following steps: two characters), and long words (e.g., words with more than 20 characters).…”

Section: B Text Preprocessingmentioning

confidence: 99%

Empirical Comparisons of CNN with Other Learning Algorithms for Text Classification in Legal Document Review

Keeling

Chhatwal²,

Huber-Fliflet³

et al. 2019

2019 IEEE International Conference on Big Data (Big Data)

Self Cite

View full text Add to dashboard Cite

Research has shown that Convolutional Neural Networks (CNN) can be effectively applied to text classification as part of a predictive coding protocol. That said, most research to date has been conducted on data sets with short documents that do not reflect the variety of documents in real world document reviews. Using data from four actual reviews with documents of varying lengths, we compared CNN with other popular machine learning algorithms for text classification, including Logistic Regression, Support Vector Machine, and Random Forest. For each data set, classification models were trained with different training sample sizes using different learning algorithms. These models were then evaluated using a large randomly sampled test set of documents, and the results were compared using precision and recall curves. Our study demonstrates that CNN performed well, but that there was no single algorithm that performed the best across the combination of data sets and training sample sizes. These results will help advance research into the legal profession's use of machine learning algorithms that maximize performance.

show abstract

Empirical evaluations of preprocessing parameters' impact on predictive coding's effectiveness

Cited by 21 publications

References 1 publication

Empirical evaluations of active learning strategies in legal document review

Empirical evaluations of active learning strategies in legal document review

Explainable Text Classification in Legal Document Review A Case Study of Explainable Predictive Coding

Empirical Comparisons of CNN with Other Learning Algorithms for Text Classification in Legal Document Review

Contact Info

Product

Resources

About