Analysis of the effect of data properties in automated patent classification

Gómez, Juan Carlos

doi:10.1007/s11192-019-03246-1

Cited by 10 publications

(3 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Semantic similarity is useful in various NLP tasks such as information retrieval, machine translation, question answering, and entity resolution (Ebraheem et al, 2018; Li et al, 2019; Varelas et al, 2005; Yih et al, 2014; Zou et al, 2013). In practice, it has a wide range of real‐life applications such as patent class prediction, scientific text comparison, newspaper article similarity analysis, topic discovery for United Nations speeches, and concept derivation for encyclopedias (Geum & Kim, 2020; Gomez, 2019; Gong et al, 2019; Shalaby & Zadrozny, 2017; Watanabe & Zhou, 2020). Going beyond determining the semantic similarity as a binary decision (1 or 0, as similar or not), generating non‐binary scores of text pairs followed by ranking has become a common goal of semantic similarity analysis.…”

Section: Literature Reviewmentioning

confidence: 99%

Semantic similarity measure of natural language text through machine learning and a keyword‐aware cross‐encoder‐ranking summarizer—A case study using UCGIS GIS&T body of knowledge

Tian

Wang

et al. 2023

Transactions in GIS

View full text Add to dashboard Cite

Initiated by the University Consortium of GeographicInformation Science (UCGIS), the GIS&T Body of Knowledge (BoK) is a community-driven endeavor to define, develop, and document geospatial topics related to geographic information science and technologies (GIS&T). In recent years, GIS&T BoK has undergone rigorous development in terms of its topic re-organization and content updating, resulting in a new digital version of the project. While the BoK topics provide useful materials for researchers and students to learn about GIS, the semantic relationships among the topics, such as semantic similarity, should also be identified so that a better and automated topic navigation can be achieved. Currently, the related topics are either defined manually by editors or authors, which may result in an incomplete assessment of topic relationships. To address this challenge, our research evaluates the effectiveness of multiple natural language processing (NLP) techniques in extracting semantics from text, including both deep neural networks and traditional machine learning approaches.Besides, a novel text summarization-KACERS (Keyword-Aware Cross-Encoder-Ranking Summarizer)-is proposed to generate a semantic summary of scientific publications.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Semantic similarity measure of natural language text through machine learning and a keyword‐aware cross‐encoder‐ranking summarizer—A case study using UCGIS GIS&T body of knowledge

Tian

Wang

et al. 2023

Transactions in GIS

View full text Add to dashboard Cite

show abstract

“…Data analysis tools, such as text classification models, can be used to put the data source selected into proper layers of data-driven TRM. Classification models such as support vector machine (SVM) [86], k-nearest neighbor (KNN) [87,88], Hidden Markov [89], and Bayesian [44,90] can be employed.…”

Section: Bidirectional Encoder Representations For Transformers With ...mentioning

confidence: 99%

Data-Driven Technology Roadmaps to Identify Potential Technology Opportunities for Hyperuricemia Drugs

et al. 2022

View full text Add to dashboard Cite

Hyperuricemia is a metabolic disease with an increasing incidence in recent years. It is critical to identify potential technology opportunities for hyperuricemia drugs to assist drug innovation. A technology roadmap (TRM) can efficiently integrate data analysis tools to track recent technology trends and identify potential technology opportunities. Therefore, this paper proposes a systematic data-driven TRM approach to identify potential technology opportunities for hyperuricemia drugs. This data-driven TRM includes the following three aspects: layer mapping, content mapping and opportunity finding. First we deal with layer mapping.. The BERT model is used to map the collected literature, patents and commercial hyperuricemia drugs data into the technology layer and market layer in TRM. The SAO model is then used to analyze the semantics of technology and market layer for hyperuricemia drugs. We then deal with content mapping. The BTM model is used to identify the core SAO component topics of hyperuricemia in technology and market dimensions. Finally, we consider opportunity finding. The link prediction model is used to identify potential technological opportunities for hyperuricemia drugs. This data-driven TRM effectively identifies potential technology opportunities for hyperuricemia drugs and suggests pathways to realize these opportunities. The results indicate that resurrecting the pseudogene of human uric acid oxidase and reducing the toxicity of small molecule drugs will be potential opportunities for hyperuricemia drugs. Based on the identified potential opportunities, comparing the DNA sequences from different sources and discovering the critical amino acid site that affects enzyme activity will be helpful in realizing these opportunities. Therefore, this research provides an attractive option analysis technology opportunity for hyperuricemia drugs.

show abstract

“…Some focused on the best way to represent the patent text and how to extract semantic features from it (D'hondt et al, 2013;Shalaby et al, 2018;Hu et al, 2018a;Hu et al, 2018b;Li et al, 2018) while others focused on designing more effective classification algorithms (Fall et al, 2003;Al Shamsi & Aung, 2016;D'hondt et al, 2017;Wu et al, 2010Wu et al, , 2016Song et al, 2019). Furthermore, some attempts have been made to find which part of the patent text can be more representative and provide better classification results (Gomez, 2019;Hu et al, 2018a;Wu et al, 2010;D'hondt & Verberne, 2010). Gomez & Moens (2014) did a comprehensive survey of several previous works that tackled the automated patent classification problem in the IPC hierarchy.…”

Section: Related Workmentioning

confidence: 99%

PatentNet: multi-label classification of patent documents using deep learning based language understanding

et al. 2021

View full text Add to dashboard Cite

Patent classification is an expensive and time-consuming task that has conventionally been performed by domain experts. However, the increase in the number of filed patents and the complexity of the documents make the classification task challenging. The text used in patent documents is not always written in a way to efficiently convey knowledge. Moreover, patent classification is a multi-label classification task with a large number of labels, which makes the problem even more complicated. Hence, automating this expensive and laborious task is essential for assisting domain experts in managing patent documents, facilitating reliable search, retrieval, and further patent analysis tasks. Transfer learning and pre-trained language models have recently achieved state-of-the-art results in many Natural Language Processing tasks. In this work, we focus on investigating the effect of fine-tuning the pre-trained language models, namely, BERT, XLNet, RoBERTa, and ELECTRA, for the essential task of multi-label patent classification. We compare these models with the baseline deep-learning approaches used for patent classification. We use various word embeddings to enhance the performance of the baseline models. The publicly available USPTO-2M patent classification benchmark and M-patent datasets are used for conducting experiments. We conclude that fine-tuning the pre-trained language models on the patent text improves the multi-label patent classification performance. Our findings indicate that XLNet performs the best and achieves a new state-of-the-art classification performance with respect to precision, recall, F1 measure, as well as coverage error, and LRAP.

show abstract

Analysis of the effect of data properties in automated patent classification

Cited by 10 publications

References 44 publications

Semantic similarity measure of natural language text through machine learning and a keyword‐aware cross‐encoder‐ranking summarizer—A case study using UCGIS GIS&T body of knowledge

Semantic similarity measure of natural language text through machine learning and a keyword‐aware cross‐encoder‐ranking summarizer—A case study using UCGIS GIS&T body of knowledge

Data-Driven Technology Roadmaps to Identify Potential Technology Opportunities for Hyperuricemia Drugs

PatentNet: multi-label classification of patent documents using deep learning based language understanding

Contact Info

Product

Resources

About