ObjectivesPhenotyping is a core task in observational health research utilizing electronic health records (EHRs). Developing an accurate algorithm typically demands substantial input from domain experts, involving extensive literature review and evidence synthesis. This burdensome process limits scalability and delays knowledge discovery. We investigate the potential for leveraging large language models (LLMs) to enhance the efficiency of EHR phenotyping by generating drafts of high-quality algorithms.Materials and MethodsWe prompted four LLMs—ChatGPT-4, ChatGPT-3.5, Claude 2, and Bard—in October 2023, asking them to generate executable phenotyping algorithms in the form of SQL queries adhering to a common data model for three clinical phenotypes (i.e., type 2 diabetes mellitus, dementia, and hypothyroidism). Three phenotyping experts evaluated the returned algorithms across several critical metrics. We further implemented the top-rated algorithms from each LLM and compared them against clinician-validated phenotyping algorithms from the Electronic Medical Records and Genomics (eMERGE) network.ResultsChatGPT-4 and ChatGPT-3.5 exhibited significantly higher overall expert evaluation scores in instruction following, algorithmic logic, and SQL executability, when compared to Claude 2 and Bard. Although ChatGPT-4 and ChatGPT-3.5 effectively identified relevant clinical concepts, they exhibited immature capability in organizing phenotyping criteria with the proper logic, leading to phenotyping algorithms that were either excessively restrictive (with low recall) or overly broad (with low positive predictive values).ConclusionBoth ChatGPT versions 3.5 and 4 demonstrate the capability to enhance EHR phenotyping efficiency by drafting algorithms of reasonable quality. However, the optimal performance of these algorithms necessitates the involvement of domain experts.