The speaker diarization is considered to be the process by which the speaker signal is segmented, and the speaker identity is grouped into homogenous regions. The central point behind this scheme is the ability to distinguish between the speaker signal and each speaker signal with the label. As mass communication and meetings grow quickly, the diarization of the speakers is burden to improve the readability of the speech transcript. To solve this problem, tangent weighted mel‐frequency cepstral coefficient (TMFCC) and the extended linear prediction with autocorrelation snapshot feature extraction and the speaker diarization approach proposes a deep convolutional neural network (DCNN) for clustering and optimization using sailfish optimizer. A new development in the HXLPS extraction method is the holoentropy with extended linear prediction with autocorrelation snapshot. TMFCC makes more efficient and improves the effectiveness of the proposed scheme using lesser energy frame and higher energy framework. When achieve this, the voice activity detection method can recognize speech and non‐speech signals. Therefore, every segmented signal is represented by the d‐vector. The label of the speaker signal is clustered according to the speaker label used in the DCNN. The evaluation methods, like tracking distance, false alarm rate, diarization error rate examine the effectiveness.