Many modern applications are modeled using graphs of some kind. Given a graph, assigning labels (usually called colors) to vertices is called graph coloring. Colors must be assigned so that no two vertices connected by an edge share the same color. Graph coloring has essential applications in many different fields, and many scalable algorithms have been proposed to solve it efficiently, such that researchers have recently started experimenting with coloring, even on many-core GPU devices. In our work, we selected, analyzed, implemented, and compared state-of-the-art algorithms suited for multi-core CPU and many-core GPU architectures. Our analysis allowed us to discover the advantages and disadvantages of each algorithm, and enabled us to implement new strategies for those algorithms running on CPU and GPU devices. We propose a new technique based on "value permutation" and "index shifting" that, once applied to the Jones-Plassmann-Luby algorithm can reduce both the runtime and the number of colors. We compare our code on standard graph benchmarks with the two most used state-of-the-art applications, cuSparse's csrColor and Gunrock's implementations, and one innovative approach named Atos. We present extensive results in terms of computation time and quality of the solution. We show that our fastest implementation is able to achieve high average speedups on mesh-like graphs, with a geometric mean (harmonic mean) of 3.16x (3.05x) against Gunrock, 4.09x (3.06x) against cuSparse, and 4.45x (2.21x) against Atos. Nonetheless it proves to be significantly less effective on scale-free graphs, winning consistently only against Gunrock, with geometric mean (harmonic mean) speedups of 2.76x (2.71x) against Gunrock, 0.13x (0.11x) against cuSparse, and 0.03x (0.01x) against Atos. Moreover, it produces 47% fewer colors than cuSparse, 7% fewer colors than Gunrock, and 63% more colors than Atos.