A multilingual document may contain text words in more than one language. In a multilingual country like India it is necessary that a document should be composed of text contents in different languages in order to reach a larger cross section of people, But on the other hand, this causes practical difficulty in OCRing such a document, because the language type of the text should be pre-determined, before employing a particular OCR(Optical Character Recognition). It is perhaps impossible to design a single recognizer which can identify a large number of scripts/languages. So, it is necessary to identify the language region of the document before feeding the document to the corresponding OCR system. Script identification aims to extract information presented in digital documents namely articles, newspapers, magazines and ebooks. This has given rise to many language identification systems. The objective of this paper is to propose a model to identify script type of different text portions using visual clues. In this work seven feature namely bottom max row, top horizontal lines, vertical lines, bottom components, tick components, top holes and bottom holes have been used to identify the script type. In this work, multilingual documents with Telugu, English and Hindi scripts have been used. From the experimentation it is understood that the identification accuracy of above 93% is achieved.
Dimensionality reduction continues to be a challenging problem with huge amounts of data being generated in the domains of bio-informatics, social networks etc. We propose a novel dimensionality reduction algorithm based on the idea of consensus clustering using genetic algorithms. Classification is used as validation and the algorithm is evaluated on benchmark data sets of dimensionality ranging from 8 to 617 features. The results are on par with the latest approaches proposed in the literature.
Feature selection is an essential technique used in high dimensional data. Basically, feature selection is focused on removing irrelevant features. But, removing redundant features is also equally important. We propose a novel feature subset selection algorithm based on the idea of consensus clustering. Our algorithm constructs a complete graph on feature space and partitions the graph using various graph partitioning algorithms from social networks. Consensus clustering is applied to find the best partitioning and final feature subset is formed by selecting the most 'representative' feature that has highest correlation to target class from each cluster. Classification is used as validation and the algorithm is evaluated on benchmark data sets of dimensionality ranging between 8 to 168 features. The results show that the proposed approach is efficient in removing irrelevant and redundant features. The number of features selected using proposed method is very less and classifier accuracies using selected features are on par with the accuracies of the latest approaches proposed in the literature.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.