A new word-based compression model allowing compressed pattern matching

Buluş, Halil Nusret; Carus, Aydin; Mesut, Altan

doi:10.3906/elk-1601-92

Cited by 7 publications

(2 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The corresponding pattern matching algorithms for different compression units and compression algorithms are also different. Usually, the short text [21], the suffix [22,23], the word [24], and the character string [25,26] are used as the pattern matching unit of the compressed content. Some studies have used the BM 3 of 17 algorithm as a pattern matching algorithm in compression format [27,28].…”

Section: Related Researchmentioning

confidence: 99%

Research on Uyghur Pattern Matching Based on Syllable Features

Abliz

Maimaiti

Wu³

et al. 2020

Information

View full text Add to dashboard Cite

Pattern matching is widely used in various fields such as information retrieval, natural language processing (NLP), data mining and network security. In Uyghur (a typical agglutinative, low-resource language with complex morphology, spoken by the ethnic Uyghur group in Xinjiang, China), research on pattern matching is also ongoing. Due to the language characteristics, the pattern matching using characters and words as basic units has insufficient performance. There are two problems for pattern matching: (1) vowel weakening and (2) morphological changes caused by suffixes. In view of the above problems, this paper proposes a Boyer–Moore-U (BM-U) algorithm and a retrievable syllable coding format based on the syllable features of the Uyghur language and the improvement of the Boyer–Moore (BM) algorithm. This algorithm uses syllable features to perform pattern matching, which effectively solves the problem of weakening vowels, and it can better match words with stem shape changes. Finally, in the pattern matching experiments based on character-encoded text and syllable-encoded text for vowel-weakened words, the BM-U algorithm precision, recall, F1-measure and accuracy are improved by 4%, 55%, 33%, 25% and 10%, 52%, 38%, 38% compared to the BM algorithm.

show abstract

Section: Related Researchmentioning

confidence: 99%

Research on Uyghur Pattern Matching Based on Syllable Features

Abliz

Maimaiti

Wu³

et al. 2020

Information

View full text Add to dashboard Cite

show abstract

“…MWCA [4], a word-based compression algorithm that we developed in a previous study, sorts all words according to their frequencies, adds the most frequent 255 words to the D1 dictionary and encodes them as 1 byte and the next 65536 words to the D2 dictionary and encodes them as 2 bytes. Although there have been many different studies in the field of text compression in recent years [5]- [9], the fact that MWCA stores dictionaries and data in different streams provides an important advantage for this study in which only dictionaries are indexed. The advantages of indexing only the word dictionaries created by MWCA instead of indexing the entire documents will be explained in the fourth section.…”

Section: Introductionmentioning

confidence: 99%

A method to improve full-text search performance of MongoDB

Mesut¹,

Öztürk²

2022

Pamukkale J Eng Sci

View full text Add to dashboard Cite

B-Tree based text indexes used in MongoDB are slow compared to different structures such as inverted indexes. In this study, it has been shown that the full-text search speed can be increased significantly by indexing a structure in which each different word in the text is included only once. The Multi-Stream Word-Based Compression Algorithm (MWCA), developed in our previous work, stores word dictionaries and data in different streams. While adding the documents to a MongoDB collection, they were encoded with MWCA and separated into six different streams. Each stream was stored in a different field, and three of them containing unique words were used when creating a text index. In this way, the index could be created in a shorter time and took up less space. It was also seen that Snappy and Zlib block compression methods used by MongoDB reached higher compression ratios on data encoded with MWCA. Search tests on text indexes created on collections using different compression options shows that our method provides 19 to 146times speed increase and 34% to 40% less memory usage. Tests on regex searches that do not use the text index also shows that the MWCA model provides 7 to 13 times speed increase and 29% to 34% less memory usage. MongoDB'de kullanılan B-Tree tabanlı metin dizinleri, ters çevrilmiş dizinler gibi farklı yapılara kıyasla yavaştır. Bu çalışmada, metindeki her farklı kelimenin yalnızca bir kez yer aldığı bir yapı indekslenerek tam metin arama hızının önemli ölçüde artırılabileceği gösterilmiştir. Daha önceki çalışmalarımızda geliştirilen Çok AkışlıKelime Tabanlı Sıkıştırma Algoritması (MWCA), kelime sözlüklerini ve verileri farklı akışlarda saklar. Belgeler bir MongoDB koleksiyonuna eklenirken MWCA ile kodlanmış ve altı farklı akışa ayrılmıştır. Her akış farklı bir alan ismi ile saklanmış ve bunlardan benzersiz kelimeler içeren üçü metin dizini oluşturulurken kullanılmıştır. Bu sayede indeks daha kısa sürede oluşturulabilmiş ve daha az yer kaplamıştır. MongoDB'de kullanılan Snappy ve Zlib blok sıkıştırma yöntemlerinin MWCA ile kodlanan veriler üzerinde daha yüksek sıkıştırma oranlarına ulaştığı da görülmüştür. Farklı yöntemler ile sıkıştırılan koleksiyonlar üzerinde oluşturulan metin dizinlerinde yapılan arama testleri, yöntemimizin 19 ila 146 kat hız artışı ve %34 ila %40 daha az bellek kullanımı sağladığını göstermiştir. Metin dizinini kullanmayan regex aramaları ile ilgili testler de MWCA modelinin 7 ila 13 kat hız artışı ve %29 ila %34 daha az bellek kullanımı sağladığını göstermiştir.

show abstract