2016 IEEE International Conference on Big Data (Big Data) 2016
DOI: 10.1109/bigdata.2016.7840747
|View full text |Cite
|
Sign up to set email alerts
|

Empirical evaluations of preprocessing parameters' impact on predictive coding's effectiveness

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2017
2017
2019
2019

Publication Types

Select...
7

Relationship

5
2

Authors

Journals

citations
Cited by 21 publications
(17 citation statements)
references
References 1 publication
0
17
0
Order By: Relevance
“…The machine learning algorithm we used to generate models was Logistic Regression. One of our prior studies demonstrated that predictive models generated with Logistic Regression perform very well on legal matter documents [13]. Other parameters we used for modeling were, bag of words with 1-gram and normalized frequency, and 20,000 tokens were used as features.…”
Section: Methodsmentioning
confidence: 99%
“…The machine learning algorithm we used to generate models was Logistic Regression. One of our prior studies demonstrated that predictive models generated with Logistic Regression perform very well on legal matter documents [13]. Other parameters we used for modeling were, bag of words with 1-gram and normalized frequency, and 20,000 tokens were used as features.…”
Section: Methodsmentioning
confidence: 99%
“…Both document and rationale models were deployed to assign probability scores to each snippet of a responsive document. M (1,2,3,4,5) top scoring snippets were selected as the identified rationales for each model. An identified snippet is a true rationale if it overlaps with the annotated rationale identified by the attorney reviewer.…”
Section: Results Of the Experimentsmentioning
confidence: 99%
“…Our prior studies demonstrated that predictive models generated with Logistic Regression perform very well on legal matter documents [2,10]. Other parameters used for modeling were bag of words with 1-gram and normalized frequency [2]. The results reported in the next section are averaged over a fivefold cross validation.…”
Section: B Experiments Designmentioning
confidence: 99%
“…For CNN, we used Keras sequence model tokenizer to prepare inputs for the CNN algorithm from the training sets with the specified vocabulary size and sequence length. The same text preprocessing function [2] was used for the SVM, LR, and FR algorithms. The text preprocessing parameters we used consisted of the following steps: two characters), and long words (e.g., words with more than 20 characters).…”
Section: B Text Preprocessingmentioning
confidence: 99%