In the past years, there is increasing awareness and acceptance among forensic speech scientists of using Bayesian reasoning and likelihood ratio (LR) framework for forensic voice comparison (FVC) and expressing expert conclusions. Numerous studies have explored overall performance using numerical LRs. Given that the data used for validation is a sample coming from an unknown distribution, little attention has been paid to the effect of sampling variability or individuals' behaviour. This thesis investigates these issues using linguistic-phonetic variables. First, it investigates how different configurations of training, test and reference speakers affect overall performance. The results show that variability in overall performance is mostly caused by varying the test speakers, while less variability is caused by sampling variability in the reference and training speakers. Second, this thesis explores the effect of sampling variability on overall performance and individuals' behaviour in relation to the use of linguistic-phonetic features. Results show that sampling variability affects overall performance to different extents using different features, while combining more features does not always improve overall performance. Sampling variability has limited effects on individuals in same-speaker comparisons, and most speakers are less affected by sampling variability in different-speaker comparisons when four or more features are used. Third, this thesis explores the effect of sampling variability on overall performance in relation to score distributions. Results reveal that system validity and reliability are more affected by differentspeaker score skewness, and less affected by same-speaker score skewness. Using different calibration methods reduces the effect of sampling variability to different extents. The resultsin this thesis have implications for both FVC using numerical LRs and FVC in general, as experts need to make pragmatic decisions whether numerical LR is used or not, and every decision made has implication to final evaluation results. Further, the results on score skewness and different calibration methods have potential contribution for improving FVC performance using automatic systems.