Ranking functions used in information retrieval are primarily used in the search engines and they are often adopted for various language processing applications. This paper introduces some novel heuristics combined with probabilistic retrieval functions and are employed in the domain of approximate string similarity problem. Various algorithms have been proposed in the literature to solve approximate string similarity problems; however, none of them makes use of probabilistic retrieval functions. We are the first to explore the intersection between these two areas, that is between string similarity and information retrieval, and propose heuristic designs to resolve this problem. First, we propose chunking heuristic function, called BREAK. We show the variants BREAK-1,-2,-OFF, which split up the terms with the sequential notion. Then we propose BREAK-n, which generalizes these variants and scales to larger datasets. In order to relate these split-ups, we propose a graphical error modelling heuristics MAKE over the BREAK variants. Finally, we propose TAKE curve, a novel feature engineering probabilistic distribution, which replaces the prevalent normalization heuristics. Taking the advantage of flexibility over the choice of heuristics, we assess the variants on the cognate detection, mutant identification and problems based on isolated spelling correction. In the extensive evaluation methods, we found that our designs perform better than prevalent heuristics and are robust against database characteristics.
Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have insufficient performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code mixing and named entities. The code and dataset is available at https:
Ranking functions in information retrieval are often used in search engines to recommend the relevant answers to the query. This paper makes use of this notion of information retrieval and applies onto the problem domain of cognate detection. The main contributions of this paper are: (1) positional segmentation, which incorporates the sequential notion; (2) graphical error modelling, which deduces the transformations. The current research work focuses on classification problem; which is distinguishing whether a pair of words are cognates. This paper focuses on a harder problem, whether we could predict a possible cognate from the given input. Our study shows that when language modelling smoothing methods are applied as the retrieval functions and used in conjunction with positional segmentation and error modelling gives better results than competing baselines, in both classification and prediction of cognates.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.