Cellular Genetic Algorithms (cGAs) exhibit a natural parallelism that makes them interesting candidates for hardware implementation, as several processing elements can operate simultaneously on subpopulations shared among them. This paper presents a scalable architecture for a cGA, suitable for FPGA implementation. A regular array of custom designed processing elements (PEs) works on a population of solutions that is spread into dual-port memory blocks locally shared by adjacent PEs. A travelling salesman problem with 150 cities was used to verify the implementation of the proposed cGA on a Virtex-6 FPGA, using a population of 128 solutions with different levels of parallelism (1, 4, 16 and 64 PEs). Results have shown that an increase of the number of PEs does not degrade the quality of the convergence of the iterative process, and that the throughput increases almost linearly with the number of PEs. Comparing with a software implementation running in a PC, the cGA with 64 PEs has shown a 45x speedup.