Inventor Name Disambiguation for a Patent Database Using a Random Forest and DBSCAN

Kim, Kunho; Khabsa, Madian; Giles, C. Lee

doi:10.1145/2910896.2925465

Cited by 13 publications

(18 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Patent data are an important resource for monitoring and evaluating scientific and technical work across a range of areas (Wang et al , 2015; Zhu and Porter, 2002) such as evaluating the productivity of an inventor or an institution (Czarnitzki et al , 2007; Rahal and Rabelo, 2006), analyzing inventor migration patterns (Doherr, 2017), exploring the economic issues associated with innovation (Miguélez and Gómezmiguélez, 2011) or assessing the influence of collaborative networks on innovation (Fleming et al , 2007). However, most patent databases do not allocate a unique identifier to each inventor (Kim et al , 2016; Li et al , 2014). As a result, distinguishing between inventors with the same name is a highly challenging task.…”

Section: Introductionmentioning

confidence: 99%

“…For example, by 2013, there were more than 32 trillion pairs of records and 8 million patents in the United States Patent and Trademark Office (USPTO) database, an impossible number to disambiguate manually. According to the US census data, common names, such as John Smith, are used by about 53 million people, which is equal to 16.4 per cent of the US population (Kim et al , 2016). Furthermore, 51.1 per cent of the inventors do not include a middle name (Akinsanmi et al , 2011).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Disambiguating USPTO inventor names with semantic fingerprinting and DBSCAN clustering

Han

Wang

et al. 2019

View full text Add to dashboard Cite

Purpose The aim of this study is to present a novel approach based on semantic fingerprinting and a clustering algorithm called density-based spatial clustering of applications with noise (DBSCAN), which can be used to convert investor records into 128-bit semantic fingerprints. Inventor disambiguation is a method used to discover a unique set of underlying inventors and map a set of patents to their corresponding inventors. Resolving the ambiguities between inventors is necessary to improve the quality of the patent database and to ensure accurate entity-level analysis. Most existing methods are based on machine learning and, while they often show good performance, this comes at the cost of time, computational power and storage space. Design/methodology/approach Using DBSCAN, the meta and textual data in inventor records are converted into 128-bit semantic fingerprints. However, rather than using a string comparison or cosine similarity to calculate the distance between pair-wise fingerprint records, a binary number comparison function was used in DBSCAN. DBSCAN then clusters the inventor records based on this distance to disambiguate inventor names. Findings Experiments conducted on the PatentsView campaign database of the United States Patent and Trademark Office show that this method disambiguates inventor names with recall greater than 99 per cent in less time and with substantially smaller storage requirement. Research limitations/implications A better semantic fingerprint algorithm and a better distance function may improve precision. Setting of different clustering parameters for each block or other clustering algorithms will be considered to improve the accuracy of the disambiguation results even further. Originality/value Compared with the existing methods, the proposed method does not rely on feature selection and complex feature comparison computation. Most importantly, running time and storage requirements are drastically reduced.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Disambiguating USPTO inventor names with semantic fingerprinting and DBSCAN clustering

Han

Wang

et al. 2019

View full text Add to dashboard Cite

show abstract

“…In these models, documents are represented as bag‐of‐words with traditional features, including term frequency, distribution over terms, and time feature. Based on the bag‐of‐words model, many models that typically compute similarities among documents have been developed for topic detection, such as CLARANS and DBSCAN . The next generation of topic detection models extended the analysis from directly clustering documents to clustering keywords.…”

Section: Introductionmentioning

confidence: 99%

Topic detection model in a single‐domain corpus inspired by the human memory cognitive process

Zhao

Luo

Wei

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

A corpus (eg, patents or news texts) is an important knowledge resource that contains various topics, such as specific technologies or social events. Topic detection models of corpus, eg, Latent Dirichlet Allocation and KeyGraph, provide an important basis for exploring the status quo and trends in science, technology, or social events. However, these models suffer from low retrieval performance as they only consider text own explicit semantics in a single-domain corpus. In addition, many incremental models, such as online-LDA, are based on time slices. In this paper, a new topic detection model is proposed to improve the topic detection performance of a single-domain corpus, which is inspired by a human memory cognitive process (THC). First, to improve the accuracy, distributions over words and inter-word relations across a corpus are utilized as background knowledge, which is a type of implicit semantics, and we can find a more semantic-sensitive part of texts. Second, to realize online topic detection without time slices, we introduce a probability gain-based dynamic probabilistic model to detect latent topics by learning a model based on the dynamic human memory cognitive process. These two steps constitute the framework of our model. The experimental results for four public datasets (Reuters-R8, Reuters-R52, WebKB, and Cade12) reveal that our model is approximately ten percent higher than other baselines (eg, KeyGraph and LDA) on the Adjusted Rand Index (ARI). KEYWORDS memory cognitive process, probability gain, topic detection INTRODUCTIONTopic modeling broadly refers to the identification of trends or themes in a curated document collection. A cluster of similar technologies refer to patent topics, 1 controversial events in news topics 2 and user's attitudes toward Twitter topics. 3 Patents topics can help researchers quickly analyze the status quo and trends of referred technologies, and news topics can help people fully understand controversial social events. Twitter topics can help governments supervise online public opinions.Many topic detection models were proposed to help people access topics in a corpus. Initial models for topic detection typically relied on clustering documents. 4 In these models, documents are represented as bag-of-words with traditional features, including term frequency, distribution over terms, and time feature. Based on the bag-of-words model, many models that typically compute similarities among documents have been developed for topic detection, such as CLARANS 5 and DBSCAN. 6 The next generation of topic detection models extended the analysis from directly clustering documents to clustering keywords. With extensive use of the Latent Dirichlet Allocation (LDA) model, 7 the Probabilistic Topic Model (PTM) has attracted considerable attention. 8 Several extended versions of PTM, which treat a topic as a distribution over keywords, have been employed for topic detection. 9 Recent research has addressed relations among keywords 10 because the sole use of keywords will lose a cons...

show abstract

“…Then we analyze matches among ambiguous pairs. Instead of using a simple heuristic, we use a binary Random Forest (RF) classifier, which has been used for evaluating matches in the author and inventor name disambiguation [3], [4], [5]. Features are extracted from common attributes from the data sources.…”

Section: Introductionmentioning

confidence: 99%

Financial Entity Record Linkage with Random Forests

Kim

Giles

2016

Proceedings of the Second International Workshop on Data Science for Macro-Modeling

Self Cite

View full text Add to dashboard Cite

Record linkage refers to the task of finding same entity across different databases. We propose a machine learning based record linkage algorithm for financial entity databases. Record linkage on financial databases are essential for information integration on certain financial entity, since those databases do not have common unified identifier. Our algorithm works in two steps to determine if a pair of record is same entity or not. First we check with proposed rules if the record pair can be exactly matched after cleaning the entity name and address. Second, inspired by earlier work on author name disambiguation, we train a binary Random Forest classifier to decide the linkage. To reduce and scale the computation, this process is done only for candidate pairs within a proposed heuristic. Initial evaluation for precision, recall and F1 measures on two different linking tasks in the Financial Entity Identification and Information Integration (FEIII) Challenge show promising results.

show abstract

Inventor Name Disambiguation for a Patent Database Using a Random Forest and DBSCAN

Cited by 13 publications

References 6 publications

Disambiguating USPTO inventor names with semantic fingerprinting and DBSCAN clustering

Disambiguating USPTO inventor names with semantic fingerprinting and DBSCAN clustering

Topic detection model in a single‐domain corpus inspired by the human memory cognitive process

Financial Entity Record Linkage with Random Forests

Contact Info

Product

Resources

About