Deep learning has gained widespread adoption in forensic voice comparison in recent years. It is mainly used to learn speaker representations, known as embedding features or vectors. In this work, the effect of identical twins on two state-of-the-art deep speaker embedding methods was investigated with special focus on metrics of forensic voice comparison. The speaker verification performance has been assessed using the likelihood-ratio framework by likelihood ratio cost and equal error rate. The AVTD twin speech dataset was applied. The results show a significant reduction in speaker verification performance when twin samples are present. Neither the adaptation of LR score calculation to twin samples, nor fine-tuning the pre-trained speaker embedding models seemed to be able to leverage this limitation. It was found that the recognition of same or different speakers was possible even in the case of identical twins but the performance dropped greatly. The lowest EER of the best performing model was 3.4% in the case of non-twin; at the same time, EER was 25.3% when twins were present. This doesn’t mean that the presented methods are useless in case of identical twins, but it must be taken into consideration that in case of a higher likelihood-ratio score (which indicates same speakers on the tested samples), the possibility of twins must also be considered in a real casework.