Automatic Speaker Analysis has largely focused on single aspects of a speaker such as her ID, gender, emotion, personality, or health state. This broadly ignores the interdependency of all the different states and traits impacting on the one single voice production mechanism available to a human speaker. In other words, sometimes we may sound depressed, but we simply have a flu, and hardly find the energy to put more vocal effort into our articulation and sound production. Recently, this lack gave rise to an increasingly holistic speaker analysis -assessing the 'larger picture' in one pass such as by multi-target learning. However, for a robust assessment, this requires large amount of speech and language resources labelled in rich ways to train such interdependency, and architectures able to cope with multi-target learning of massive amounts of speech data. In this light, this contribution will discuss efficient mechanisms such as large socialmedia pre-scanning with dynamic cooperative crowd-sourcing for rapid data collection, cross-task-labelling of these data in a wider range of attributes to reach 'big & rich' speech data, and efficient multi-target end-to-end and end-to-evolution deep learning paradigms to learn an accordingly rich representation of diverse target tasks in efficient ways. The ultimate goal behind is to enable machines to hear the 'entire' person and her condition and whereabouts behind the voice and words -rather than aiming at a single aspect blind to the overall individual and its state, thus leading to the next level of Automatic Speaker Analysis.