Background: Phishing URLs are critical security threats to internet users. They serve as weapons to perpetrate cyberattacks such as phishing, scam and drive-by-download attacks. These attacks cause inevitable losses to businesses and their users. Recently, Machine Learning Phishing URL classification (MLPU) systems have gained tremendous popularity to detect phishing URLs proactively. However, the security vulnerabilities of MLPUs remain mostly unknown. Aim: To address this concern, we conducted a study to understand the test time security vulnerabilities of the state-of-the-art MLPU systems in order to provide guidelines for the future development of these systems. Method: In this paper, we propose an evasion attack framework against MLPU systems. To achieve this, we first develop an algorithm to generate adversarial phishing URLs. We then reproduce 41 MLPU systems and record their baseline performance. Finally, we simulate an evasion attack to evaluate these MLPU systems against our generated adversarial URLs. Results: In comparison to the previous works, our attack is: (i) effective as it evades all the models with an average success rate of 66% and 85% for famous (such as Netflix, Google) and less popular phishing targets (e.g., Wish, JBHIFI, Officeworks) respectively; (ii) realistic as it requires only 23ms to produce a new adversarial URL variant that is available for registration with a median cost of only $11.99/year. We also found that popular online services such as Google SafeBrowsing and VirusTotal are unable to detect these URLs. (iii) We find that Adversarial training (successful defence against evasion attack) does not significantly improve the robustness of these systems as it decreases the success rate of our attack by only 6% on average for all the models. (iv) Further, we identify the security vulnerabilities of the considered MLPU systems. Our findings can lead to promising directions for the future research. Conclusion: Our study not only illustrate vulnerabilities in MLPU systems but also highlights the implications for future research towards assessing and improving these systems.