Proceedings of the 24th International Conference on World Wide Web 2015
DOI: 10.1145/2740908.2741736
|View full text |Cite
|
Sign up to set email alerts
|

Big Scholarly Data in CiteSeerX

Abstract: We examine CiteSeerX, an intelligent system designed with the goal of automatically acquiring and organizing largescale collections of scholarly documents from the world wide web. From the perspective of automatic information extraction and modes of alternative search, we examine various functional aspects of this complex system in order to investigate and explore ongoing and future research developments 1 .

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
8
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(8 citation statements)
references
References 30 publications
0
8
0
Order By: Relevance
“…Fan [16] 147 documents Metadata Färber [17] 90K documents References Grennan [21] 1B references References Saier [51,50] 1M documents. References Ley [30,52] 6M documents Metadata, references Mccallum [39] 935 documents Metadata, references Kyle [34] 8.1M documents Metadata, references Ororbia [43] 6M documents. Metadata, references Bast [6] 12,098 documents Body text, sections, title Li [31] 500K pages Captions, equations, figures, footers lists, metadata, paragraphs, references, sections, tables…”
Section: Labeled Datasets and Prior Benchmarksmentioning
confidence: 99%
See 1 more Smart Citation
“…Fan [16] 147 documents Metadata Färber [17] 90K documents References Grennan [21] 1B references References Saier [51,50] 1M documents. References Ley [30,52] 6M documents Metadata, references Mccallum [39] 935 documents Metadata, references Kyle [34] 8.1M documents Metadata, references Ororbia [43] 6M documents. Metadata, references Bast [6] 12,098 documents Body text, sections, title Li [31] 500K pages Captions, equations, figures, footers lists, metadata, paragraphs, references, sections, tables…”
Section: Labeled Datasets and Prior Benchmarksmentioning
confidence: 99%
“…Publications in chronological order; the labels indicate the first author only. 2 (B) Body text, (M) Metadata, (R) References, (RO) Reading order, (ToC) Table of contents 3 Domain-specific datasets: Computer Science: CiteSeer[43], CORA[39], DBLP[52], FLUX-CiM[10,11], ManCreat[42]; Health Science: PubMed[40], PMC[41] …”
mentioning
confidence: 99%
“…Citation extraction can be performed using ParsCit [92], FLUX-CiM [93] and a CRF-based system [94]. Ororbia et al [47] compared the three methods and concluded that the performance of ParsCit and CRF-based system is comparable. Besides this, it outperforms FLUX-CiM.…”
Section: Citationsmentioning
confidence: 99%
“…Duplicates can be exact duplicates or near-duplicates. In order to eliminate exact duplicates, the SHA1 values of newly extracted documents are matched with that of existing documents and key mapping algorithm can be used for getting rid of near duplicates [47]. The Key Mapping algorithm is also used to align papers to citations.…”
Section: Elimination Of Duplicatesmentioning
confidence: 99%
“…[14] Khan [15] 4 PDF ParsCit [16,17] CiteSeerX [18] [19] ParsCit FLUX-CiM [20] [21] [22] FLUX-CiM [24] [25] Kikkawa [26] Halfaker [9] [8]…”
mentioning
confidence: 99%