Domain relevant data and an adequate number of samples are necessary to properly evaluate the robustness of the Machine Learning (ML) models. This is the case for ML models used in the software localization translation task. In general, Neural Machine Translation (NMT) models are used in software localization by automating the translation process of textual content to consider specific linguistic aspects and culture. However, unlike general machine translation which can easily utilize translation corpus for model training and testing, domain-specific machine translation faces a major obstacle due to the scarcity of domain-specific translation data. In the absence of adequate data, this paper first presents a method to generate test samples based on a text generation Large Language Model (LLM) approach. Based on the generated samples, we run tests to assess the robustness of an NMT translation model. The evaluation indicates that human judgment is important to check if the generated text is robust and coherent under different conditions. The evaluation also demonstrates that the generated samples were crucial to show some limitations related to the model’s effectiveness in software localization translation. Basically we discuss issues in specific situations such as date, time formats, numeric representations and measurement units.