Abstract-Retrieving keywords requires speed and compactness. A trie is one of the data structure to retrieve keywords, and the double array is one of the implementation methods for the trie. The retrieval algorithm for the double array is fast, and its data structure has compactness. An edge of the trie is represented by a character in previous researches related to the double array, but there are no researches discussing if the edge is represented n-gram. Therefore, this paper proposes the data structure and the retrieval algorithm for the double array which represents an edge by n-byte. This paper also proposes a method to compress CODE array. From the experimental results comparing with the original double array by using single-byte and multi-byte character sets, the size and the retrieval speed of the proposed method became 62-64% and 1.18-1.3 times, respectively. When the CODE is compressed, the sizes of the proposed method became 41-59%.Index Terms-Compression, double array, n-gram, trie.
I. INTRODUCTIONIn ubiquitous environments such as smart phones and PDAs, the storage capacity is often limited. Retrieving keywords used in many applications requires speed and compactness. A trie is one of data structures to retrieve keywords. In the trie, common prefixes of stored keys are merged and each edge is labeled with a character consisting of keys. Because the trie can retrieve common prefix keywords and predictive keywords, it is used in information retrieval systems [1], natural language processing [2], IP address routing tables [3], and packet filtering [4]. Moreover, the trie is often used as an associative array [5] like a map class in C++ in order to improve the array by hash tables.A double array is one of the retrieval methods by using the trie. This method uses two arrays called BASE and CHECK, and it has speed and compactness [6], [7]. An edge of the trie is represented by a character in previous researches related to the double array. As for the compression of the double array, there are methods dividing the trie [8], [9], a method removing BASE array [10], but there are no researches discussing if the edge is represented n-gram. Therefore, this paper proposes the double array method which represents an edge by n-byte. This paper also proposes a method to compress CODE array by using the double array, because CODE array becomes big with n"s increasing.Section II describes the trie and the double array. Section Manuscript received September 23, 2013; revised November 25, 2013. This work was supported by JSPS KAKENHI Grant Number 24500118.The authors are with the Department of Information Science and Intelligent, University of Tokushima, Tokushima, Japan (e-mail: fuketa@is.tokushima-u.ac.jp, kam@is.tokushima-u.ac.jp, aoe@is.tokushima-u.ac.jp).III describes the proposed data structures and retrieval algorithms. Experimental evaluations are given in Section IV. Finally, Section V concludes the proposed algorithm and describes further works.
II. DOUBLE ARRAYA trie is a tree structure to store some keys. T...