BackgroundDeep phenotyping describes the use of formal and standardised terminologies to create comprehensive phenotypic descriptions of biomedical phenomena. While most often employed to describe patients, phenotype models may also be developed to characterise diseases. These characterisations facilitate secondary analysis, evidence synthesis, and practitioner awareness, thereby guiding patient care. The vast majority of this knowledge is derived from sources that describe an academic understanding of disease, including academic literature and experimental databases. Previous work has revealed a gulf between the priorities, perspectives, and perceptions held by healthcare researchers and providers and the users of clinical services. A comparison between canonical disease descriptions and phenotype models developed from public discussions of disease offers the prospect of discovery of new phenotypes, patient population stratification, and targeted mitigation of symptoms most damaging to patients quality of life.MethodsUsing a dataset representing disease and phenotype co-occurrence in social media text, we employ semantic techniques to identify phenotype associations for a set of common and rare diseases, constituting a phenotype model for those diseases that represents the public perspective. We create an integrated resource for biomedical database and literature-derived disease-phenotype associations by aligning data from several previous studies. We then explore differences between the disease-phenotype associations derived from writing in social media with those from the clinical literature and biomedical databases, with a focus on identification of differential themes and novel phenotypes. We also perform an evaluation of associations for several diseases, with specialist clinicians reviewing associations for validity, feasibility, and involvement in clinical care.FindingsWe identified 35,782 significant disease-phenotype associations from social media across 311 diseases, of which 304 could be linked to a combined resource of associations derived from academic sources. Social media-derived disease profiles recapitulated those from academic sources (AUC=0.874 (.95=0.858-0.891)). We further identified 26,081 novel phenotype associations that were not contained in the academic sources, of which 15,084 were considered significant. Constitutional symptoms, those holistic manifestations of disease affecting quality of life, were strongly over-represented in the social media phenotype, contributing more associations especially to endocrine, digestive, and reproductive diseases. An expert clinical review found that social media-derived associations were considered similarly well-established to those derived from literature, and were seen significantly more in patient clinical encounters.InterpretationThe phenotype model recovered from social media presents a significantly different perspective than existing resources derived from biomedical databases and literature, providing a large number of associations novel to the latter dataset. We propose that the integration and interrogation of these public perspectives on disease can inform clinical awareness, improve secondary analysis, and bridge understanding across healthcare stakeholders.