Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation and automatic question-answering systems. Recognizing the importance of NER, a plethora of NER techniques for Western and Asian languages have been developed. However, despite having over 490 million Urdu language speakers worldwide, NER resources for Urdu are either non-existent or inadequate. To fill this gap, this article makes four key contributions. First, we have developed the largest Urdu NER corpus, which contains 926,776 tokens and 99,718 carefully annotated NEs. The developed corpus has at least doubled the number of manually tagged NEs as compared to any of the existing Urdu NER corpora. Second, we have generated six new word embeddings using three different techniques, fastText, Word2vec, and Glove, on two corpora of Urdu text. These are the only publicly available embeddings for the Urdu language, besides the recently released Urdu word embeddings by Facebook. Third, we have pioneered in the application of deep learning techniques, NN and RNN, for Urdu named entity recognition. Finally, we have performed 10-folds of 32 different experiments using the combinations of a traditional supervised learning and deep learning techniques, seven types of word embeddings, and two different Urdu NER datasets. Based on the analysis of the results, several valuable insights are provided about the effectiveness of deep learning techniques, the impact of word embeddings, and variations of datasets.
Business process models are the conceptual models to depict the workflow of an organization. Process model matching (PMM) refers to the automatic identification of corresponding activities between a pair of process models that show similar or the same behavior. During the last few years, PMM has received much of the researchers' attention due to its wide range of applications, such as clone detection and harmonization of process models. Consequently, a plethora of PMM techniques has been developed. In order to evaluate the effectiveness of these techniques, experts have developed three benchmark datasets, formally called PMMC'15 datasets. Furthermore, the process models in the datasets have been converted into OAEI'17 ontologies. These resources are a valuable asset for the PMM community to evaluate process model matching techniques. However, these resources (PMMC'15 and OAEI'17) are limited to fewer models and a handful collection of corresponding activities among these models that may not be sufficient to rigorously evaluate the PMM techniques. To fill this gap, this paper provides a large, diverse, and a carefully handcrafted collection of process models, along with their benchmark correspondences. The process model collection and benchmark correspondences between these models are freely available for the community [1]. Our newly developed dataset, together with the existing resources, can be used for a thorough evaluation of PMM techniques, especially in the context of the vocabulary mismatch problem. At last, we have evaluated the characteristics of our dataset by a series of experiments while involving widely used similarity measures in PMM research. The results reveal that our dataset is larger, diverse, and challenging as compared to existing datasets in the PMM domain.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.