With the coming of cyberspace and the development of communication tools such as social networks. Now, the digital society is enriched continually by new content, especially the human images, which represent more than 50% of the information existed in the web. Hence, the necessity of an effective instrument for the automatic classification of this gigantic imagery base has become primordial. The content of our work, is a novel approach called clustering of Human Gesture Images using 3D cellular automaton, consists of 4 steps: Image vectoring using new image representation technique called n-gram pixels, and a normalised term frequency as weighting to calculate the importance of each term in the image. Our clustering strategy are based on the principle of 3D-CA, using a set of properties (transition function, and the 3D Moore neighbourhood). The experimentation using the dataset MuHAVi and a variety of validation measures. The performance of our approach were compared to the conventional methods in term of, representation (naive representation), and clustering strategy (kmeans).