The third-generation semiconductor materials (TGSMs) is a frontier scientific domain, where researchers need to consult extensive literature for the entity information on materials, devices, preparation methods, and experimental performances, and sort out the complex relations between them. However, the explosion of relevant papers has far exceeded researchers' reading ability. In this article, the TGSM-field automatic information extraction is conducted based on entity recognition (ER) and relation extraction (RE) techniques. First, the corpora used for ER and RE in this field are created. Second, aiming at the complexity of the entities, a neural network using domain knowledge (DKNet) is proposed to improve ER performance. It uses the keyword sequence of each entity type as prior knowledge, adds a dedicated embedding to encode entity categories, then combines prior knowledge and encoded vectors with the context through a gated information fusion module to assist recognition. As for the indicative word dependence problem of entity relations, a multi-aspect attention-based network model (MANet) is proposed to enhance the attention to relation-indicative words, thereby improving the RE performance. Finally, F1 scores of 74.5 and 85.9 were achieved on the created ER and RE test sets, outperforming other advanced models by 3.4 ~ 10.1, which is the best performance of the TGSM-field automatic information extraction.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.