Gender classification (GC) has achieved high accuracy in different experimental evaluations based mostly on inner facial details. However, these results do not generalize well in unrestricted datasets and particularly in cross-database experiments, where the performance drops drastically. In this paper, we analyze the state-of-the-art GC accuracy on three large datasets: MORPH, LFW and GROUPS. We discuss their respective difficulties and bias, concluding that the most challenging and wildest complexity is present in GROUPS. This dataset covers hard conditions such as low resolution imagery and cluttered background. Firstly, we analyze in depth the performance of different descriptors extracted from the face and its local context on this dataset. Selecting the bests and studying their most suitable combination allows us to design a solution that beats any previously published results for GROUPS with the Dago's protocol, reaching an accuracy over 94.2%, reducing the gap with other simpler datasets. The chosen solution based on local descriptors is later evaluated in a cross-database scenario with the three mentioned datasets, and full dataset 5-fold cross validation. The achieved results are compared with a Convolutional Neural Network approach, achieving rather similar marks. Finally, a solution is proposed combining both focuses, exhibiting great complementarity, boosting GC performance to beat previously published results in GC both cross-database, and full in-database evaluations.