Cutting-edge technologies such as genome editing and synthetic biology allow us to produce novel foods and functional proteins. However, their toxicity and allergenicity must be accurately evaluated. Allergic reactions are caused by specific amino-acid sequences in proteins (Allergen Specific Patterns, ASPs), of which, many remain undiscovered. In this study, we introduce a data-driven approach and a machine-learning (ML) method to find undiscovered ASPs. The proposed method enables an exhaustive search for amino-acid sub-sequences whose frequencies are statistically significantly higher in allergenic proteins. As a proof-of-concept (PoC), we created a database containing 21,154 proteins of which the presence or absence allergic reactions are already known, and the proposed method was applied to the database. The detected ASPs in the PoC study were consistent with known biological findings, and the allergenicity prediction accuracy using the detected ASPs was higher than extant approaches.TeaserWe propose a computational method for finding statistically significant allergen-specific amino-acid sequences in proteins.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.