Nowadays, especially with the upswing of neural networks, speech synthesis is almost totally data driven. The goal of this thesis is to provide methods for automatic and unsupervised learning from data for expressive speech synthesis. In comparison to "ordinary" synthesis systems, it is more difficult to find reliable expressive training data, despite huge availability on sources like Internet. The main difficulty consists in the highly speaker-and situation-dependent nature of expressiveness, causing many and acoustically substantial variations. The consequences are, first, it is very difficult to define labels which reliably identify expressive speech with all nuances. The typical definition of 6 basic emotions, or alike, is a simplification which will have inexcusable consequences dealing with data outside the lab. Second, even if a label set is defined, apart of the enormous manual effort, it is difficult to gain sufficient training data for the models respecting all the nuances and variations.