Recent advances in language modelling has significantly decreased the need of labelled data in text classification tasks. Transformer-based models, pre-trained on unlabeled data, can outmatch the performance of models trained from scratch for each task. However, the amount of labelled data need to fine-tune such type of model is still considerably high for domains requiring expert-level annotators, like the legal domain. This paper investigates the best strategies for optimizing the use of a small labeled dataset and large amounts of unlabeled data and perform a classification task in the legal area with 50 predefined topics. More specifically, we use the records of demands to a Brazilian Public Prosecutor's Office aiming to assign the descriptions in one of the subjects, which currently demands deep legal knowledge for manual filling. The task of optimizing the performance of classifiers in this scenario is especially challenging, given the low amount of resources available regarding the Portuguese language, especially in the legal domain. Our results demonstrate that classic supervised models such as logistic regression and SVM and the ensembles random forest and gradient boosting achieve better performance along with embeddings extracted with word2vec when compared to BERT language model. The latter demonstrates superior performance in association with the architecture of the model itself as a classifier, having surpassed all previous models in that regard. The best result was obtained with Unsupervised Data Augmentation (UDA), which jointly uses BERT, data augmentation, and strategies of semi-supervised learning, with an accuracy of 80.7% in the aforementioned task.
In intelligent watermarking of document images, evolutionary computing (EC) techniques are employed in order to determine embedding parameters of digital watermarking systems such that the trade-off between watermark robustness and image fidelity is optimized. However, existing techniques for intelligent watermarking rely on full optimization of embedding parameters for each image. This approach does not apply to high data rate applications due to its high computational complexity.In this paper, a novel intelligent watermarking technique based on Dynamic Particle Swarm Optimization (DPSO) is proposed. Intelligent watermarking of bi-tonal image streams is formulated as a dynamic optimization problem. This population-based technique allows to evolve a diversified set of solutions (i.e., embedding parameters) to an optimization problem, and solutions from previous optimizations are archived and re-considered prior to triggering new optimizations. In such case, costly optimization may be replaced by direct recall of quasi identical solutions. Simulations involving the intelligent watermarking of several long streams of homogeneous PDF document images resulted in a decrease of computational burden (number of fitness evaluations) of up to 97.2% with a negligible impact on accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.