In this paper, we propose to estimate tree defoliation from ground-level RGB photos with convolutional neural networks (CNN). Tree defoliation is usually assessed with field campaigns, where experts estimate multiple tree health indicators per sample site. Campaigns span entire countries to come up with a holistic, nation-wide picture of forest health. Surveys are very laborious, expensive, time-consuming and need a large number of experts. We aim at making the monitoring process more efficient by casting tree defoliation estimation as an image interpretation problem. What makes this task challenging is strong variance in lighting, viewpoint, scale, tree species, and defoliation types. Instead of accounting for each factor separately through explicit modelling, we learn a joint distribution directly from a large set of annotated training images following the end-to-end learning paradigm of deep learning. The proposed workflow works as follows: (i) Human experts visit individual trees in forests distributed all over Switzerland, (ii) acquire one photo per tree with an off-theshelf, hand-held RGB camera and (iii) assign a defoliation value. The CNN approach is (iv) trained on a subset of the images with expert defoliation assessments and (v) tested on a hold-out part to check predicted values against ground truth. We evaluate our supervised method on three data sets with different level of difficulty acquired in Swiss forests and achieve an average mean absolute error (avgMAE) of 7.6% for the joint data set after crossvalidation. Comparison to a group of human experts on one of the data sets shows that our CNN approach performs only 0.9 percent points worse. We show that tree defoliation estimation from ground-level RGB images with a CNN works well and achieves performance close to human experts.