“…The documents are tokenized using wordpieces and are chopped to spans no longer than 150 tokens on 20NG 15 and 20NG and 256 on other datasets.. Hyper-parameters: For our method, we use λ on = λ off = 1, δ on = 10 −4 , δ off = 10 −3 and δ y = 0.1 for all the datasets. We then conduct an extensive hyper-parameter search for the baselines: for label smoothing, we search the smoothing parameter from {0.05, 0.1} as in (Müller et al, 2019); for ERL, the penalty weight is chosen from {0.05, 0.1, 0.25, 0.5, 1, 2.5, 5}; for VAT, we search the perturbation size in {10 −3 , 10 −4 , 10 −5 } as in (Jiang et al, 2020); for Mixup, we search the interpolation parameter from {0.1, 0.2, 0.3, 0.4} as suggested in (Zhang et al, 2018;Thulasidasan et al, 2019); for Manifold-mixup, we search from {0.2, 0.4, 1, 2, 4}. We perform 10 stochastic forward passes for MCDP at test time.…”