BibPro: A Citation Parser Based on Sequence Alignment Techniques

Yang

IEEE Trans. Knowl. Data Eng.

et al. 2012

Self Cite

Dramatic increase in the number of academic publications has led to growing demand for efficient organization of the resources to meet researchers' needs. As a result, a number of network services have compiled databases from the public resources scattered over the Internet. However, publications by different conferences and journals adopt different citation styles. It is an interesting problem to accurately extract metadata from a citation string which is formatted in one of thousands of different styles. It has attracted a great deal of attention in research in recent years. In this paper, based on the notion of sequence alignment, we present a citation parser called BibPro that extracts components of a citation string. To demonstrate the efficacy of BibPro, we conducted experiments on three benchmark data sets. The results show that BibPro achieved over 90 percent accuracy on each benchmark. Even with citations and associated metadata retrieved from the web as training data, our experiments show that BibPro still achieves a reasonable performance.

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

BibPro: A Citation Parser Based on Sequence Alignment

Yang

IEEE Trans. Knowl. Data Eng.

et al. 2012

Self Cite

2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

“…Bibro [13] is a template-based citation parser, and the key idea of BibPro is using the order of punctuation marks and reserved words in a citation string to represent its citation style. For a given citation string, BibPro encodes it as a protein sequence, which preserves citation style information.…”

Section: Bibpromentioning

confidence: 99%

“…By using these two properties, our approach first analyzes the DOM tree and find out a tree level where nodes are most likely to represent citation records. To estimate whether a node is represented as a citation record, our previous work "BibPro" [13] is applied to calculate the probability, which was designed for parsing a citation record into several fields (e.g., author, title, venue, etc.). When a string of a node is given, BibPro can output the probability that the given string is a citation string, hence we can find out one tree level in the DOM tree where citation records exist.…”

Section: Introductionmentioning

confidence: 99%

Parsing Publication Lists on the Web

Yang

2010

Self Cite

Researchers usually present their publication records (we call citation records in this paper) on publication lists on the Web, which could be an important data source for many applications to collect more publication records than from some digital libraries, such as DBLP. However, it is still not easy to design an algorithm to extract citation records from publication lists because of the diversity of page layouts and citation formats. In this paper, we propose an automatic approach to extract citation records from publication list pages by utilizing two properties. First, citation records are usually represented as nodes at the same level in the DOM tree. Second, citation records in the same page are presented by similar HTML tags. Extensive experiments are conducted to measure the effects of all parameters and system performance. Experiment results show that our approach performs stable and well (with 86.2% of F-measure on average).

Concurrency and Computation

Citation entity recognition method using multi‐feature semantic fusion based on deep learning

Gao

Zhang

Cao

et al. 2021

The effective entity recognition method can quickly and accurately identify the citation entity to facilitate citation comparison, thereby reducing the occurrence of academic fraud and other behaviors. But there is no very effective way to solve this problem till now. In recent years, neural network models for named entity recognition (NER) have shown better performances on general domain datasets. After the multi-feature citation dataset is created, the article proposes contextual multi-feature embedding (CMFE) method for word embedding which use multi-feature to enhance semantic and use CNN to get multi-level feature. Based on CMFE, a multi-feature semantic fusion model (MFSFM) is proposed. It designs the multi-convolution kernel mixed residual CNN module to obtain local attention information and enhance the sensitivity of the entity boundary information. The BiLSTM and LSTM is used for timing learning. The experimental results of Chinese citation datasets and Chinese-English mixed citation datasets show that CMFE can better represent semantics, and MFSFM can perform citation entity recognition well. Finally, the experimental results of CONLL2003 dataset show that it is general on NER.