This chapter defines the term "speaker recognition" and looks at this technology from a high-level perspective. Then, it introduces the fundamental concepts of speaker recognition, including feature extraction, speaker modeling, scoring, and performance measures.
Fundamentals of Speaker RecognitionFrom time to time we hear that the information of millions of customers of remote services has been compromised. These security leaks cause concerns about the security of the remote services that everyone uses on a daily basis. While these remote services bring convenience and benefit to users, they are also gold mines for criminals to carry out fraudulent acts. The conventional approach to user authentication, such as usernames and passwords, is no longer adequate for securing these services. A number of companies have now introduced voice biometrics as a complement to the conventional username-password approach. With this new authentication method, it is much harder for the criminals to imitate the legitimate users. Voice biometrics can also reduce the risk of leaking customers' information caused by social engineering fraudulence. Central to voice biometrics authentication is speaker recognition. Another application domain of voice biometrics is to address the privacy issues of smartphones, home assistants, and smart speakers. With the increasing intelligence capabilities of these devices, we can interact with them as if they were human. Because these devices are typically used solely by their owners or their family members and speech is the primary means of interaction, it is natural to use the voice of the owners for authentication, i.e., a device can only be used by its owner.Speaker recognition is a technique to recognize the identity of a speaker from a speech utterance. As shown in Figure 1.1, in terms of recognition tasks, speaker recognition can be categorized into speaker identification, speaker verification, and speaker diarization. In all of these tasks, the number of speakers involved can be fixed (closed set) or varied (open set).Speaker identification is to determine whether the voice of an unknown speaker matches one of the N speakers in a dataset, where N could be very large (thousands). It is a one-to-many mapping and it is often assumed that the unknown voice