Natural Language Processing (NLP) is an interdisciplinary field between linguistics and computer science. Its main aim is to process natural (human) language using computer programs. Text classification is one of the main tasks of this field, and they are widely used in many different applications such as spam filtering, sentiment analysis, and document categorization. Nonetheless, there is only very little text classification work in the law domain and even less for the Turkish language. This may be attributed to the complexity within the domain. The length, complexity of documents, and use of extensive technical jargon are some of the reasons that distinguish this domain from others. Similar to the medical domain, understanding these documents requires extensive specialization. Another reason can be the scarcity of publicly available datasets. In this study, we compile sizeable unsupervised and supervised datasets from publicly available sources and experiment with several classification algorithms ranging from traditional classifiers to much more complicated deep learning and transformer-based models along with different text representations. We focus on classifying Court of Cassation decisions for their crime labels. Interestingly, the majority of the models we experiment with could be able to obtain good results. This suggests that although understanding the documents in the legal domain is complicated and requires expertise from humans, it may be relatively easier for machine learning models despite the extensive presence of the technical terms. This seems to be especially the case for transformer-based pre-trained neural language models which can be adapted to the law domain, showing high potential for future real-world applications.
One of the applications of Natural Language Processing (NLP) is to process free text data for extracting information. Information extraction has various forms like Named Entity Recognition (NER) for detecting the named entities in the free text. Biomedical named-entity extraction task is about extracting named entities like drugs, diseases, organs, etc. from texts in medical domain. In our study, we improve commonly used models in this domain, such as biLSTM+CRF model, using transformer based language models like BERT and its domainspecific variant BioBERT in the embedding layer. We conduct several experiments on several different benchmark biomedical datasets using a variety of combination of models and embeddings such as BioBERT+biLSTM+CRF, BERT+biLSTM+CRF, Fasttext+biLSTM+CRF, and Graph Convolutional Networks. Our results show a quite visible, 4% to 13%, improvements when baseline biLSTM+CRF model is initialized with pretrained language models such as BERT and especially with domain specific one like BioBERT on several datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.