In structure-based virtual screening (SBVS), it is critical that scoring functions capture protein-ligand atomic interactions. By focusing on the local domains of ligand binding pockets, a standardized pocket Pfam-based clustering (Pfam-cluster) approach was developed to assess the cross-target generalization ability of machine-learning scoring functions (MLSFs). Subsequently, 11 typical MLSFs were evaluated using random cross-validation (Random-CV), protein sequence similarity-based cross-validation (Seq-CV), and pocket Pfam-based cross-validation (Pfam-CV) methods. Surprisingly, all of the tested models showed decreased performances from Random-CV to Seq-CV to Pfam-CV experiments, not showing satisfactory generalization capacity. Our interpretable analysis suggested that the predictions on novel targets by MLSFs were dependent on buried solvent-accessible surface area (SASA)-related features of complex structures, with larger predicted binding affinities on complexes owning larger protein-ligand interfaces. By combining buried SASA-related features with complex-specific patterns that were only shared among structurally similar compounds in the same cluster, random forest (RF)-Score attained a good performance in Random-CV test. Based on these findings, we strongly advise to assess the generalization ability of MLSFs with Pfam-cluster approach and to be cautious with the features learned by MLSFs.