Co-evolutionary learning involves a training process where training samples are instances of solutions that interact strategically to guide the evolutionary (learning) process. One main research issue is with the generalization performance, i.e., the search for solutions (e.g., input-output mappings) that best predict the required output for any new input that has not been seen during the evolutionary process. However, there is currently no such framework for determining the generalization performance in co-evolutionary learning even though the notion of generalization is well-understood in machine learning. In this paper, we introduce a theoretical framework to address this research issue. We present the framework in terms of game-playing although our results are more general. Here, a strategy's generalization performance is its average performance against all test strategies. Given that the true value may not be determined by solving analytically a closed-form formula and is computationally prohibitive, we propose an estimation procedure that computes the average performance against a small sample of random test strategies instead. We perform a mathematical analysis to provide a statistical claim on the accuracy of our estimation procedure, which can be further improved by performing a second estimation on the variance of the random variable. For game-playing, it is well-known that one is more interested in the generalization performance against a biased and diverse sample of "good" test strategies. We introduce a simple approach to obtain such a test sample through the multiple partial enumerative search of the strategy space that does not require human expertise and is generally applicable to a wide range of domains. We illustrate the generalization framework on the co-evolutionary learning of the iterated prisoner's dilemma (IPD) games. We investigate two definitions of generalization performance for the IPD game based on different performance criteria, e.g., in terms of the number of wins based on individual outcomes and in terms of average payoff. We show that a small sample of test strategies can be used to estimate the generalization performance. We also show that the generalization performance using a biased and diverse set of "good" test strategies is lower compared to the unbiased case for the IPD game. This is the first time that generalization is defined and analyzed rigorously in co-evolutionary learning. The framework allows the evaluation of the generalization performance of any co-evolutionary learning system quantitatively.