Applying Convolutional Neural Networks on high-resolution images leads to very large intermediate feature maps, which dominate the memory traffic. Processing in the classical layer-by-layer order creates the requirement to store the complete feature maps at once, when moving from one layer to the next. As the size of these feature maps only realistically allows this in off-chip memory, this leads to high off-chip bandwidth, which comes at great energy costs. The DepFiN processor chip, presented in this paper, overcomes this cost by running CNNs in a deep layer fusion mode, dubbed depth-first execution, made possible by a control flow that supports frequently switching between layers. To furthermore tackle the computational cost as well, the computationally efficient depth-wise+pointwise layer pairs are explicitly supported in DepFiN by a novel accelerator core that can dynamically change its configuration to manage the low computational intensity of the depthwise layers. Benchmarking measurements show the 12nm DepFiN chip reaching up to 20 TOPS/W peak, 8.2 TOPS/W on the MC-CNN-fast stereo-matching network excl. IO power (at 8-bit 0.6 Vdd), and, crucially, 3.95 TOPS/W with the IO power included on the same network and an up to 18× improvement realized by supporting depth-first (MC-CNN-fast at 8-bit, 0.65V Vdd).