Safety Assessments and Multiplicity Adjustment: Comments on a Recent Paper M any end points representing possible non-target effects have to be evaluated in safety assessments that compare new and accepted products. This is known as the multiplecomparison or multiplicity problem. One popular method to adjust statistical testing procedures for multiplicity is the false discovery rate (FDR) method, 1 often implemented via adjustment of p values, such as for example provided in SAS procedure MULTTEST. FDR-adjusted p values are obtained by multiplication of the raw p values with factors between 1 and m, where m is the number of tested hypotheses. Let p 1 ≤ ... ≤ p m be the ordered p values for m end points. Then, the FDRadjusted p values according to a linear step-up algorithm are sequentially calculated as p( m) = p (m) ; p( j) = min(p( j+1) ,(m/j)p (j) ), for j = m − 1, ..., 1.Recently, Hong et al. 2 published an evaluation of the European Food Safety Authority (EFSA) framework for safety assessment of genetically modified (GM) crops using a rat 90 day feeding study, 3 which is a compulsory part of the safety assessment according to current European Union (EU) legislation. 4 The appropriateness of these animal studies and the EFSA framework on how to conduct such studies are both under discussion. For example, the EU research project GRACE (http://www.grace-fp7.eu) has performed and evaluated four 90 day and one 1 year study contributing to this discussion (see the study by Schmidt et al. 5 and references therein). Another currently ongoing EU research project is G-TwYST (https://www.g-twyst.eu), which is evaluating two 90 day studies and one combined chronic/carcinogenicity (2 year) study. Hong et al. 2 also assessed the appropriateness and applicability of the EFSA recommendations using a 90 day study and a battery of statistical approaches, including retrospective and prospective power analyses. This comment is not the place to give a full appraisal of all aspects of this discussion. The discussion here is restricted to just one element of the statistical approach used, which is the treatment of the multiplicity as a result of many end points. Hong et al. evaluated a very large number of end points and adjusted the p values of their tests according to the FDR method. The maximum number of end points for each of the sexes was m = 146; therefore, FDR-adjusted p values are between 1 and up to 146 times as large as the raw p values (FDR adjustment was performed for the set of all end points that were reported across sex and separately for the male-or female-specific comparisons and, thus, may have been lower in practice, but exact values are not given). The main result of Hong et al. regarding the comparisons between test and control groups is that "no treatment-related differences were observed". This can be contrasted with the detailed comparisons in Appendix D of the paper, where 32 out of 816 of the 95% confidence intervals for observed differences do not contain the value 0 and, therefore, indicate significant...