ObjectivesWe aimed to examine agreement between common mental disorders (CMDs) from primary care records and repeated CMD questionnaire data from ALSPAC (the Avon Longitudinal Study of Parents and Children) over adolescence and young adulthood, explore factors affecting CMD identification in primary care records, and construct models predicting ALSPAC-derived CMDs using only primary care data.Design and settingProspective cohort study (ALSPAC) in Southwest England with linkage to electronic primary care records.ParticipantsPrimary care records were extracted for 11 807 participants (80% of 14 731 eligible). Between 31% (3633; age 15/16) and 11% (1298; age 21/22) of participants had both primary care and ALSPAC CMD data.Outcome measuresALSPAC outcome measures were diagnoses of suspected depression and/or CMDs. Primary care outcome measure were Read codes for diagnosis, symptoms and treatment of depression/CMDs. For each time point, sensitivities and specificities for primary care CMD diagnoses were calculated for predicting ALSPAC-derived measures of CMDs, and the factors associated with identification of primary care-based CMDs in those with suspected ALSPAC-derived CMDs explored. Lasso (least absolute selection and shrinkage operator) models were used at each time point to predict ALSPAC-derived CMDs using only primary care data, with internal validation by randomly splitting data into 60% training and 40% validation samples.ResultsSensitivities for primary care diagnoses were low for CMDs (range: 3.5%–19.1%) and depression (range: 1.6%–34.0%), while specificities were high (nearly all >95%). The strongest predictors of identification in the primary care data for those with ALSPAC-derived CMDs were symptom severity indices. The lasso models had relatively low prediction rates, especially in the validation sample (deviance ratio range: −1.3 to 12.6%), but improved with age.ConclusionsPrimary care data underestimate CMDs compared to population-based studies. Improving general practitioner identification, and using free-text or secondary care data, is needed to improve the accuracy of models using clinical data.