This paper presents an approach for Multilingual News Document Clustering in comparable corpora. We have implemented two algorithms of heuristic nature that follow the approach. They use as unique evidence for clustering the identification of cognate named entities between both sides of the comparable corpora. In addition, no information about the right number of clusters has to be provided to the algorithms. The applicability of the approach only depends on the possibility of identifying cognate named entities between the languages involved in the corpus. The main difference between the two algorithms consists of whether a monolingual clustering phase is applied at first or not. We have tested both algorithms with a comparable corpus of news written in English and Spanish. The performance of both algorithms is slightly different; the one that does not apply the monolingual phase reaches better results. In any case, the obtained results with both algorithms are encouraging and show that the use of cognate named entities can be enough knowledge for deal with multilingual clustering of news documents.
Objective: Medical coding is used to identify and standardize clinical concepts in the records collected from healthcare services. The tenth revision of the International Classification of Diseases (ICD-10) is the most widely-used coding with more than 11,000 different diagnoses, affecting research, reporting, and funding. Unfortunately, ICD-10 code sets tend to follow biased, unbalanced, and scattered distributions. These distribution attributes, along with high lexical variability, severely restrict performance when coded clinical records are used to infer code sets in uncoded records. To improve that inference, we explore a combination of example-based methods optimized to capture codes with different appearance frequencies in data sets. Materials and Methods: The proposed exploration has been carried out on Spanish hospital discharge reports coded by experts, excluding all sentences without any biomedical concept. Representations based on semantic and lexical features are explored, using both global and labelspecific attributes. In turn, algorithms based on binary outputs, groups of subsets and extreme classification are compared. Lists of codes together with their confidence values (certainty probabilities) are suggested by each method. Results: Diverse spectral behaviors are shown for each method. Binary classifiers seem to maximize the capture of more popular codes, while extreme classifiers promote infrequent ones. In order to exploit such differences, ensemble approaches are proposed by weighting every output code according to the method, confidence value and appearance frequency. The rule-based combination reaches a 46% Precision at 10 (P@10), which means a 15% improvement over the best individual proposal. Conclusion: Assembling methods based on weighting each code according to training frequency and performance can achieve better overall Precision scores on extreme distributions, such as ICD-10 coding. INDEX TERMS Extreme classification, XMTC, ICD-10 coding, text mining.
Abstract. This paper is focused on discovering bilingual news clusters in a comparable corpus. Particularly, we deal with the news representation and with the calculation of the similarity between documents. We use as representative features of the news the cognate named entities they contain. One of our main goals consists of proving whether the use of only named entities is a good source of knowledge for multilingual news clustering. In the vectorial news representation we take into account the category of the named entities. In order to determine the similarity between two documents, we propose a new approach based on a fuzzy system, with a knowledge base that tries to incorporate the human knowledge about the importance of the named entities category in the news. We have compared our approach with a traditional one obtaining better results in a comparable corpus with news in Spanish and English.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.