Abstract. We propose a low-complexity audio-visual person authentication framework based on multiple features and multiple nearest-neighbor classifiers, which instead of a single template uses a set of codebooks or collection of templates. Several novel highly-discriminatory speech and face image features are introduced along with a novel "text-conditioned" speaker recognition approach. Powered by discriminative scoring and a novel fusion method, the proposed MCCN method delivers not only excellent performance (0% EER) but also a significant separation between the scores of client and imposters as observed on trials run on a unique multilingual 120-user audio-visual biometric database created for this research.