Code smells refer to any symptom in the source code of a program that possibly indicates a deeper problem, hindering software maintenance and evolution. Detection of code smells is challenging for developers and their informal definition leads to the implementation of multiple detection techniques and tools. This paper evaluates and compares four code smell detection tools, namely inFusion, JDeodorant, PMD, and JSpIRIT. These tools were applied to different versions of the same software systems, namely MobileMedia and Health Watcher, to calculate the accuracy and agreement of code smell detection tools. We calculated the accuracy of each tool in the detection of three code smells: God Class, God Method, and Feature Envy. Agreement was calculated among tools and between pairs of tools. One of our main findings is that the evaluated tools present different levels of accuracy in different contexts. For MobileMedia, for instance, the average recall varies from 0 to 58% and the average precision from 0 to 100%, while for Health Watcher the variations are 0 to 100% and 0 to 85%, respectively. Regarding the agreement, we found that the overall agreement between tools varies from 83 to 98% among all tools and from 67 to 100% between pairs of tools. We also conducted a secondary study of the evolution of code smells in both target systems and found that, in general, code smells are present from the moment of creation of a class or method in 74.4% of the cases of MobileMedia and 87.5% of Health Watcher.