Background: Large Language Models (LLMs) like GPT-3.5 and GPT-4 are increasingly entering the healthcare domain as a proposed means to assist with administrative tasks. To ensure safe and effective use with billing coding tasks, it is crucial to assess these models' ability to generate the correct International Classification of Diseases (ICD) codes from text descriptions. Objectives: We aimed to evaluate GPT-3.5 and GPT-4's capability to generate correct ICD billing codes, using the ICD-9-CM (2014) and ICD-10-CM and PCS (2023) systems. Methods: We randomly selected 100 unique codes from each of the most recent versions of the ICD-9-CM, ICD-10-CM, and ICD-10-PCS billing code sets published by the Centers for Medicare and Medicaid Services. Using the ChatGPT interface (GPT-3.5 and GPT-4), we prompted for the ICD codes that corresponding to each provided code description. Outputs were compared with the actual billing codes across several performance measures. Errors were qualitatively and quantitatively assessed for any underlying patterns. Results: GPT-4 and GPT-3.5 demonstrated varied performance across each ICD system. In ICD-9-CM, GPT-4 and GPT-3.5 achieved an exact match rate of 22% and 10%, respectively. 13% (GPT-4) and 10% (GPT-3.5) of generated ICD-10-CM codes were exact matches. Notably, both models struggled considerably with the procedurally focused ICD-10-PCS, with neither GPT-4 or GPT-3.5 producing any exactly matched codes. A substantial number of incorrect codes had semantic similarity with the actual codes for ICD-9-CM (GPT-4: 60.3%, GPT-3.5: 51.1%) and ICD-10-CM (GPT-4: 70.1%, GPT-3.5: 61.1%), in contrast to ICD-10-PCS (GPT-4: 30.0%, GPT-3.5: 16.0%). Conclusion: Our evaluation of GPT-3.5 and GPT-4's proficiency in generating ICD billing codes from ICD-9-CM, ICD-10-CM and ICD-10-PCS code descriptions reveals an inadequate level of performance. While the models appear to exhibit a general conceptual understanding of the codes and their descriptions, they have a propensity for hallucinating key details, suggesting underlying technological limitations of the base LLMs. This suggests a need for more rigorous LLM augmentation strategies and validation prior to their implementation in healthcare contexts, particularly in tasks such as ICD coding which require significant digit-level precision.