2013
DOI: 10.1162/coli_a_00149
|View full text |Cite
|
Sign up to set email alerts
|

Text Representations for Patent Classification

Abstract: With the increasing rate of patent application filings, automated patent classification is of rising economic importance. This article investigates how patent classification can be improved by using different representations of the patent documents. Using the Linguistic Classification System (LCS), we compare the impact of adding statistical phrases (in the form of bigrams) and linguistic phrases (in two different dependency formats) to the standard bag-of-words text representation on a subset of 532,264 Engli… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 41 publications
(12 citation statements)
references
References 22 publications
0
12
0
Order By: Relevance
“…The bag-of-words (BOW) model [8,18] is a typical, statistically-based text representation approach, which is almost always used in patent analysis studies [1,18]. After stemming, filtering and stop-word removal, the BOW represents each document by the words' occurrences, ignoring their ordering and grammar in the original document.…”
Section: Feature Extraction From Textmentioning
confidence: 99%
“…The bag-of-words (BOW) model [8,18] is a typical, statistically-based text representation approach, which is almost always used in patent analysis studies [1,18]. After stemming, filtering and stop-word removal, the BOW represents each document by the words' occurrences, ignoring their ordering and grammar in the original document.…”
Section: Feature Extraction From Textmentioning
confidence: 99%
“…The functionality of n-gram graphs was provided by the open source library of JInsect. 9 To derive the performance of the representation models, we applied them to two established classification algorithms that are typically used for TC in conjunction with the bag models: Naive Bayes Multinomial (NBM) and Support Vector Machines (SVM) [29,44]. The former classifies instances based on the conditional probabilities of their feature values, while the latter uses optimization techniques to identify the maximum margin decision hyperplane.…”
Section: Setupmentioning
confidence: 99%
“…In [9], the authors examine four content-based representation models: lemmatized token unigrams, lemmatized token bigrams and lemmatized "dependency triples" 15 that were obtained from the Stanford and the AEGIR parsers. The comparative analysis was performed over a set of curated documents that contained patent abstracts in English.…”
Section: Related Workmentioning
confidence: 99%
“…Other parts are narrative text providing information regarding the patent and are given under the headings: Title, Abstract, Field, Background, Detailed description, and Claims. There are many ways to represent the whole patent document in previous patent automatic classification and retrieval applications 15,20,21 . Some researchers believe that the human generated abstracts of patent documents are very precise and the most important section for patent classification.…”
Section: Automatic Patent Classificationmentioning
confidence: 99%