Background
Single-domain antibodies or nanobodies have recently attracted much attention in research and applications because of their great potential and advantage over conventional antibodies. However, isolation of candidate nanobodies in the lab has been costly and time-consuming. Screening of leading nanobody candidates through synthetic libraries is a promising alternative, but it requires prior knowledge to control the diversity of the complementarity-determining regions (CDRs) while still maintaining functionality. In this work, we identified sequence characteristics that could contribute to nanobody functionality by analyzing three datasets, CDR1, CDR2, and CDR3.
Results
By classification of amino acids based on physicochemical properties, we found that two different amino acid groups were sufficient for CDRs. The nonpolar group accounted for half of the total amino acid composition in these sequences. Observation of the highest occurrence of each amino acid revealed that the usage of some important amino acids such as tyrosine and serine was highly correlated with the length of the CDR3. Amino acid repeat motifs were also under-represented and highly restricted as 3-mers. Inspecting the crystallographic data also demonstrated conservation in structural coordinates of dominant amino acids such as methionine, isoleucine, valine, threonine, and tyrosine and certain positions in the CDR1, CDR2, and CDR3 sequences.
Conclusions
We identified sequence characteristics that contributed to functional nanobodies including amino acid groups, the occurrence of each kind of amino acids, and repeat patterns. These results provide a simple set of rules to make it easier to generate desired candidates by computational means; also, they can be used as a reference to evaluate synthetic nanobodies.
Background
The roles of antibody and antigen are indispensable in targeted diagnosis, therapy, and biomedical discovery. On top of that, massive numbers of new scientific articles about antibodies and/or antigens are published each year, which is a precious knowledge resource but has yet been exploited to its full potential. We, therefore, aim to develop a biomedical natural language processing tool that can automatically identify antibody and antigen entities from articles.
Results
We first annotated an antibody-antigen corpus including 3210 relevant PubMed abstracts using a semi-automatic approach. The Inter-Annotator Agreement score of 3 annotators ranges from 91.46 to 94.31%, indicating that the annotations are consistent and the corpus is reliable. We then used the corpus to develop and optimize BiLSTM-CRF-based and BioBERT-based models. The models achieved overall F1 scores of 62.49% and 81.44%, respectively, which showed potential for newly studied entities. The two models served as foundation for development of a named entity recognition (NER) tool that automatically recognizes antibody and antigen names from biomedical literature.
Conclusions
Our antibody-antigen NER models enable users to automatically extract antibody and antigen names from scientific articles without manually scanning through vast amounts of data and information in the literature. The output of NER can be used to automatically populate antibody-antigen databases, support antibody validation, and facilitate researchers with the most appropriate antibodies of interest. The packaged NER model is available at https://github.com/TrangDinh44/ABAG_BioBERT.git.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.