Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw protein LM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%); the second were per-protein predictions of protein sub-cellular localization (ten-state accuracy: Q10=81%) and membrane vs. water-soluble (2-state accuracy Q2=91%). For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that protein LMs learned some of the grammar of the language of life. To facilitate future work, we released our models at https://github.com/agemagician/ProtTrans.
Motivation: Natural Language Processing (NLP) continues improving substantially through auto-regressive (AR) and auto-encoding (AE) Language Models (LMs). These LMs require expensive computing resources for self-supervised or un-supervised learning from huge unlabelled text corpora. The information learned is transferred through so-called embeddings to downstream prediction tasks. Computational biology and bioinformatics provide vast gold-mines of structured and sequentially ordered text data leading to extraordinarily successful protein sequence LMs that promise new frontiers for generative and predictive tasks at low inference cost. As recent NLP advances link corpus size to model size and accuracy, we addressed two questions: (1) To which extent can High-Performance Computing (HPC) up-scale protein LMs to larger databases and larger models? (2) To which extent can LMs extract features from single proteins to get closer to the performance of methods using evolutionary information? Methodology: Here, we trained two auto-regressive language models (Transformer-XL and XLNet) and two auto-encoder models (BERT and Albert) on 80 billion amino acids from 200 million protein sequences (UniRef100) and one language model (Transformer-XL) on 393 billion amino acids from 2.1 billion protein sequences taken from the Big Fat Database (BFD), today's largest set of protein sequences (corresponding to 22- and 112-times, respectively of the entire English Wikipedia). The LMs were trained on the Summit supercomputer, using 936 nodes with 6 GPUs each (in total 5616 GPUs) and one TPU Pod, using V3-512 cores. Results: We validated the feasibility of training big LMs on proteins and the advantage of up-scaling LMs to larger models supported by more data. The latter was assessed by predicting secondary structure in three- and eight-states (Q3=75-83, Q8=63-72), localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabelled data (only protein sequences) captured important biophysical properties of the protein alphabet, namely the amino acids, and their well orchestrated interplay in governing the shape of proteins. In the analogy of NLP, this implied having learned some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC slightly reduced the gap between models trained on evolutionary information and LMs. Additionally, our results highlighted the importance of bi-directionality when processing proteins as the uni-directional TransformerXL was outperformed by its bi-directional counterparts. Availability ProtTrans: <a href="https://github.com/agemagician/ProtTrans">https://github.com/agemagician/ProtTrans</a>
JOREK is a massively parallel fully implicit non-linear extended magneto-hydrodynamic (MHD) code for realistic tokamak X-point plasmas. It has become a widely used versatile simulation code for studying large-scale plasma instabilities and their control and is continuously developed in an international community with strong involvements in the European fusion research programme and ITER organization. This article gives a comprehensive overview of the physics models implemented, numerical methods applied for solving the equations and physics studies performed with the code. A dedicated section highlights some of the verification work done for the code. A hierarchy of different physics models is available including a free boundary and resistive wall extension and hybrid kinetic-fluid models. The code allows for flux-surface aligned iso-parametric finite element grids in single and double X-point plasmas which can be extended to the true physical walls and uses a robust fully implicit time stepping. Particular focus is laid on plasma edge and scrape-off layer (SOL) physics as well as disruption related phenomena. Among the key results obtained with JOREK regarding plasma edge and SOL, are deep insights into the dynamics of edge localized modes (ELMs), ELM cycles, and ELM control by resonant magnetic perturbations, pellet injection, as well as by vertical magnetic kicks. Also ELM free regimes, detachment physics, the generation and transport of impurities during an ELM, and electrostatic turbulence in the pedestal region are investigated. Regarding disruptions, the focus is on the dynamics of the thermal quench (TQ) and current quench triggered by massive gas injection and shattered pellet injection, runaway electron (RE) dynamics as well as the RE interaction with MHD modes, and vertical displacement events. Also the seeding and suppression of tearing modes (TMs), the dynamics of naturally occurring TQs triggered by locked modes, and radiative collapses are being studied.
Abstract. Disruptions in a large tokamak can cause serious damage to the device and should be avoided or mitigated. Massive gas or killer pellet injection are possible ways to obtain a controlled fast plasma shutdown before a natural disruption occurs. In this work, plasma shutdown scenarios with different types of impurities are studied for an ITER-like plasma. Plasma cooling, runaway generation and the associated electric field diffusion are calculated with a 1D-code taking the Dreicer, hot-tail and avalanche runaway generation processes into account. Thin, radially localised sheets with high temperature can be created after the thermal quench, and the Dreicer and avalanche processes produce a high runaway current inside these sheets. At high impurity concentration the Dreicer process is suppressed but hot-tail runaways are created. Favourable thermal and current quench times can be achieved with a mixture of deuterium and neon or argon. However, to prevent the avalanche process from creating a significant runaway current fraction, it is found to be necessary to include runaway losses in the model.
Fast particles in fusion plasmas may drive Alfvén modes unstable leading to fluctuations of the internal electromagnetic fields and potential loss of particles. Such instabilities can have an impact on the performance and the wall-load of machines with burning plasmas such as ITER. A linear benchmark for a toroidal Alfvén eigenmode (TAE) is done with 11 participating codes with a broad variation in the physical as well as the numerical models. A reasonable agreement of around 20% has been found for the growth rates. Also, the agreement of the eigenfunctions and mode frequencies is satisfying. however, they are found to depend strongly on the complexity of the used model.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.