In the present work an innovative attempt is being made to develop a novel conflation method that exploits the phonetic quality of words and uses some standard NLP tools like LD (Levenshtein Distance) and LCS (Longest Common Subsequence) for Stemming process.
In the present paper we attempted to generate a parametric model for word frequencies.In order to make this relation applicable, we arranged word lengths in accordance with their normalized frequencies. The pattern of occurrence of words containing different numbers of letters has been investigated on the basis of their Zipf's order and by applying power law for Zipf's order and frequencies. The applicability of the generated mathematical model for word length frequencies was verified for different texts. We also resolved the problem of establishing a relationship between word frequencies of higher Zipf's order with text length.
Suffix stripping is a problem of removing morphological suffixes from a word to get the stem. We present suffix stripping as an unconstrained optimization problem. Free from linguistic or morphological knowledge, a simple algorithm is being developed. Superiority of the algorithm over an established technique for English language is being demonstrated. Suffix stripping ist der Prozess des systematischen Entfernens von Suffixen um zum Stamm zu gelangen. Wir präsentieren Suffix Stripping als ein Optimierungs problem ohne Nebenbedingungen. Ein einfacher Algorithmus jenseits linguistischen oder morphologischen Wissens wird entwickelt. Damit wird der Vorrang des Algorithmus vor einer Technik der englischen Sprache demonstriert. Stichwort: information wiedergewinnung, stemming, verschmelzung, stammeln, bearbeiten entfernung bs_bs_banner International Journal of Applied Linguistics ◆
A language independent stemmer has always been looked for. Single N-gram tokenization technique works well, however, it often generates stems that start with intermediate characters, rather than initial ones. We present a novel technique that takes the concept of N-gram stemming one step ahead and compare our method with an established algorithm in the field, Porter's Stemmer. Results indicate that our N-gram stemmer is not inferior to Porter's linguistic stemmer.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.