The ongoing evolution of hardware leads to a steady increase in the amount of data that is processed, transmitted and stored. Data compression is an essential tool to keep the amount of data manageable. Furthermore, techniques from data compression have many more applications beyond compression, for instance data clustering, classification and time series prediction.In terms of empirical performance statistical data compression algorithms rank among the best. A statistical data compressor processes an input text letter by letter and performs compression in two stages -modeling and coding. During modeling a model estimates a probability distribution on the next letter based on the past input. During coding an encoder translates this probability distribution and the next letter into a codeword. Decoding reverts this process. Note that the model is exchangeable and its actual choice determines a statistical data compression algorithm. All major models use a mixer to combine multiple simple probability estimators, so-called elementary models.In statistical data compression there is an increasing gap between theory and practice. On the one hand, the "theoretician's approach" puts emphasis on models that allow for a mathematical code length analysis to evaluate their performance, but neglects running time and space considerations and empirical improvements. On the other hand the "practitioner's approach" focuses on the very reverse. The family of PAQ statistical compressors demonstrated the superiority of the "practitioner's approach" in terms of empirical compression rates.With this thesis we attempt to bridge the aforementioned gap between theory and practice with special focus on PAQ. To achieve this we apply the theoretician's tools to practitioner's approaches: We provide a code length analysis for several common and practical modeling and mixing techniques. The analysis covers modeling by relative frequencies with frequency discount and modeling by exponential smoothing of probabilities. For mixing we consider linear and geometrically weighted averaging of probabilities with Online Gradient Descent for weight estimation. Our results show that the models and mixers we consider perform nearly as well as idealized competitors that may adapt to the input. Experiments support our analysis. Moreover, our results add a theoretical justification to modeling and mixing from PAQ and generalize methods from PAQ. ii