2017 Data Compression Conference (DCC) 2017
DOI: 10.1109/dcc.2017.58
|View full text |Cite
|
Sign up to set email alerts
|

Making Compression Algorithms for Unicode Text

Abstract: The majority of online content is written in languages other than English, and is most commonly encoded in UTF-8, the world's dominant Unicode character encoding. Traditional compression algorithms typically operate on individual bytes. While this approach works well for the single-byte ASCII encoding, it works poorly for UTF-8, where characters often span multiple bytes. Previous research has focused on developing Unicode compressors from scratch, which often failed to outperform established algorithms such a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(4 citation statements)
references
References 6 publications
(12 reference statements)
0
4
0
Order By: Relevance
“…In this method of compression, the initial part is to read data from the input stream present in the form of UTF-8 characters. The only known work which reads data in the form of UTF-8 characters belongs to Gleave et al (2017). In their work, they had investigated the effectiveness of different token distributions while being used as a base distribution for LZW.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…In this method of compression, the initial part is to read data from the input stream present in the form of UTF-8 characters. The only known work which reads data in the form of UTF-8 characters belongs to Gleave et al (2017). In their work, they had investigated the effectiveness of different token distributions while being used as a base distribution for LZW.…”
Section: Methodsmentioning
confidence: 99%
“…Barua et al (2017) projected an enhanced LZW compression technique for Bangla dialect considering the unique features of that language. Gleave et al (2017) represented modified techniques with escaping on LZW and PPM (Prediction with Partial Matching). An abridged bit representation in the dictionary is an indicative for each Unicode character.…”
Section: Introductionmentioning
confidence: 99%
“…It promises ratio improvements of around 16% over state of the art compression tools. Text compression beyond ASCII, applicable to the human-readable log messages, has been explored by modifications to existing bytelevel compressors such as bzip2, with significant effectiveness improvements reported [10], and semantic compression for text has been investigated as well [11].…”
Section: Related Workmentioning
confidence: 99%
“…Linkon et al projected a changed LZW dictionary based index compression technique for Bangle dialect in [5]. Gleave et al represent a new modified technique of byte-oriented compressors to work straight on Unicode characters [6]. In [8], the system is configured to maintain a set of character tables and a cluster table in memory .…”
Section: Related Workmentioning
confidence: 99%