OCR-based image features for biomedical image and article classification

Shatkay, Hagit; Narayanaswamy, Ramya; Nagaral, Santosh S.; Harrington, Na; Rohith, M.; Somanath, Gowri; Tarpine, Ryan; Schutter, Kyle; Johnstone, Timothy G; Blostein, Dorothea; Istrail, Sorin; Kambhamettu, Chandra

doi:10.1145/2382936.2382949

Cited by 5 publications

(3 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our document representation is based on a variation on the bag-of-terms model that we have introduced and used in our earlier work ( 39 , 40 , 41 ). The representation uses a set of terms consisting of both unigrams (single words) and bigrams (pairs of two consecutive words).…”

Section: Methodsmentioning

confidence: 99%

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

et al. 2017

Self Cite

View full text Add to dashboard Cite

The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area. Database URL: www.informatics.jax.org

show abstract

Section: Methodsmentioning

confidence: 99%

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

et al. 2017

Self Cite

View full text Add to dashboard Cite

show abstract

“…Our initial document representation is based on the bag-of-words model, used in our earlier work (40, 41). The set of terms consists of both unigrams (single words) and bigrams (pairs of two consecutive words).…”

Section: Methodsmentioning

confidence: 99%

“…The set of terms consists of both unigrams (single words) and bigrams (pairs of two consecutive words). Using a limited number of meaningful terms as features for document representation has been proven effective in our earlier work (40, 41). To reduce the number of features, we first annotate documents using two readily available biomedical NER tools, Pubtator (42–44) and BeCAS (45).…”

Section: Methodsmentioning

confidence: 99%

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance

et al. 2019

Self Cite

View full text Add to dashboard Cite

Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory’s Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance.

show abstract

Integrating image caption information into biomedical document classification in support of biocuration

Jiang

Kadin

et al. 2020

Database

View full text Add to dashboard Cite

Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation. We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60 000 documents (5469 labeled as relevant; 52 866 as irrelevant), gathered throughout 2012–2016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier’s performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation. Database URL:

show abstract

OCR-based image features for biomedical image and article classification

Cited by 5 publications

References 17 publications

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance

Integrating image caption information into biomedical document classification in support of biocuration

Contact Info

Product

Resources

About