Cross-lingual information retrieval (CLIR) systems facilitate users to query for information in one language and retrieve relevant documents in another language. In general, CLIR systems translate query in source language to target language and retrieve documents in target language based on the keywords present in the translated query. However, the presence of ambiguity in source and translated queries reduces the performance of the system. Ontology can be used to address this problem. The current approaches to ontology-based CLIR systems use manually constructed multilingual ontology, which is expensive. However, many methods exist to automatically construct ontology for any domain in English but not in other languages like Tamil. We propose a methodology for Tamil-English CLIR system by translating the Tamil query to English and retrieve pages in English to address these issues. Our approach uses a word sense disambiguation module to resolve the ambiguity in Tamil query. An automatically constructed ontology in English is used to address the ambiguity of English query. We have developed a morphological analyser for Tamil language, Tamil-English bilingual dictionary and named entity database to translate a Tamil query to English. The translated query is reformulated using ontology and the reformulated queries are given to a search engine to retrieve English documents from the Internet. We have evaluated our methodology for agriculture domain and the evaluation results show that our approach outperforms other approaches in terms of precision.
Social media has effectively become the prime hub of communication and digital marketing. As these platforms enable the free manifestation of thoughts and facts in text, images and video, there is an extensive need to screen them to protect individuals and groups from offensive content targeted at them. Our work intends to classify code-mixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam. We intend to improve offensive language identification by generating pseudo-labels on the dataset. A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language, either Kannada, Malayalam, or Tamil and then generating pseudo-labels for the transliterated dataset. The two datasets are combined using the generated pseudo-labels to create a custom dataset called CM-TRA. As Dravidian languages are under-resourced, our approach increases the amount of training data for the language models. We fine-tune several recent pretrained language models on the newly constructed dataset. We extract the pretrained language embeddings and pass them onto recurrent neural networks. We observe that fine-tuning ULMFiT on the custom dataset yields the best results on the code-mixed test sets of all three languages. Our approach yields the best results among the benchmarked models on Tamil-English, achieving a weighted F1-Score of 0.7934 while scoring competitive weighted F1-Scores of 0.9624 and 0.7306 on the code-mixed test sets of Malayalam-English and Kannada-English, respectively. The data and codes for the approaches discussed in our work have been released 1 .1 https://github.com/adeepH/Dravidian-OLI *
Marine species recognition is the process of identifying various species that help in population estimation and identifying the endangered types for taking further remedies and actions. The superior performance of deep learning for classification is due to the property of estimating millions of parameters that have to be extracted from many annotated datasets. However, many types of fish species are becoming extinct, which may reduce the number of samples. The unavailability of a large dataset is a significant hurdle for applying a deep neural network that can be overcome using transfer learning techniques. To overcome this problem, we propose a transfer learning technique using a pre-trained model that uses underwater fish images as input and applies a transfer learning technique to detect the fish species using a pre-trained Google Inception-v3 model. We have evaluated our proposed method on the Fish4knowledge(F4K) dataset and obtained an accuracy of 95.37%. The research would be helpful to identify fish existence and quantity for marine biologists to understand the underwater environment to encourage its preservation and study the behavior and interactions of marine animals.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.