7Background: Influenza A virus (IAV) poses threats to human health and life. Many individual 8 studies have been carried out in mice to uncover the viral factors responsible for the virulence 9 of IAV infections. Virus adaptation through serial lung-to-lung passaging and reverse genetic 10 engineering and mutagenesis approaches have been widely used in the studies. Nonetheless, a 11 single study may not provide enough confident about virulence factors, hence combining 12 several studies for a meta-analysis is desired to provide better views. 13 Methods: Virulence information of IAV infections and the corresponding virus and mouse 14 strains were documented from literature. Using the mouse lethal dose 50, time series of weight 15 loss or percentage of survival, the virulence of the infections was classified as avirulent or 16 virulent for two-class problems, and as low, intermediate or high for three-class problems. On 17 the other hand, protein sequences were decoded from the corresponding IAV genomes or 18 reconstructed manually from other proteins according to mutations mentioned in the related 19literature. IAV virulence models were then learned from various datasets containing IAV 20 proteins whose amino acids at their aligned position and the corresponding two-class or three-21 class virulence labels. Three proven rule-based learning approaches, i.e., OneR, JRip and 22 PART, and additionally random forest were used for modelling, and top protein sites and 23 synergy between protein sites were identified from the models.
24Results: More than 500 records of IAV infections in mice whose viral proteins could be 25 retrieved were documented. The BALB/C and C57BL/6 mouse strains and the H1N1, H3N2 26 and H5N1 viruses dominated the infection records. PART models learned from full or subsets 27 of datasets achieved the best performance, with moderate averaged model accuracies ranged 28 from 65.0% to 84.4% and from 54.0% to 66.6% for two-class and three-class datasets that 29 utilized all records of aligned IAV proteins, respectively. Their averaged accuracies were 30 comparable or even better than the averaged accuracies of random forest models and should be 31 preferred based on the Occam's razor principle. Interestingly, models based on a dataset that 32 included all IAV strains achieved a better averaged accuracy when host information was taken 33 into account. For model interpretation, we observed that although many sites in HA were highly 34 correlated with virulence, PART models based on sites in PB2 could compete against and were 35 often better than PART models based on sites in HA. Moreover, PART had a high preference 36 to include sites in PB2 when models were learned from datasets containing concatenated 37 alignments of all IAV proteins. Several sites with a known contribution to virulence were found 38 as the top protein sites, and site pairs that may synergistically influence virulence were also 39 uncovered. 40 Conclusion: Modelling the virulence of IAV infections is a challenging problem. Rule-bas...