2014 IEEE International Conference on Big Data (Big Data) 2014
DOI: 10.1109/bigdata.2014.7004457
|View full text |Cite
|
Sign up to set email alerts
|

Dealing with heterogeneous big data when geoparsing historical corpora

Abstract: It has long been known that 'variety' is one of the key challenges and opportunities of big data. This is especially true when we consider the variety of content in historical corpora resulting from large-scale digitisation activities. Collections such as Early English Books Online (EEBO) and the British Library 19th Century Newspapers are extremely large and heterogeneous data sources containing a variety of content in terms of time, location, topic, style and quality. The range of geographical locations refe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
2
2

Relationship

2
7

Authors

Journals

citations
Cited by 13 publications
(11 citation statements)
references
References 7 publications
0
11
0
Order By: Relevance
“…Conducting this process in an entirely automated manner was found not to be satisfactory for the complex place-names found in the CLDW (see Butler et al 2017). The process was enhanced using concordance geoparsing -where a small subset of the text is geoparsed, the results are checked, and any corrections fed into processing subsequent subsets (Rupp et al 2014) -and a considerable amount of manual checking.…”
Section: From Text To Gis Databasementioning
confidence: 99%
“…Conducting this process in an entirely automated manner was found not to be satisfactory for the complex place-names found in the CLDW (see Butler et al 2017). The process was enhanced using concordance geoparsing -where a small subset of the text is geoparsed, the results are checked, and any corrections fed into processing subsequent subsets (Rupp et al 2014) -and a considerable amount of manual checking.…”
Section: From Text To Gis Databasementioning
confidence: 99%
“…These place-names then need to be matched to coordinates taken from a gazetteer to provide locations. When performed automatically, this process is known as geo-parsing [12,25]. At the end of this process, the coordinates can be used as spatial data and the cotext (the text immediately surrounding the place-name) can be used as attribute data, along with other information from either the text, its metadata or other datasets, such as gazetteers.…”
Section: Background 21 the Lake District Deep Mapping Projectmentioning
confidence: 99%
“…In previous research [24,25] we employed an Early Modern English variant spelling detector (VARD) to match historical to modern forms, and the DEEP 'Historical Gazetteer of England's Place-Names' 4 to improve coverage for the off-the-shelf geoparser. Based on this prior work we were able to determine that the Edinburgh Geoparser identified only 1277 of the 3718 place-name entities included in the Gold Standard subset.…”
Section: Gold Standard Corpusmentioning
confidence: 99%
“…PNCs also provide us with a simpler, and often more accurate, way of geoparsing a text. By just geoparsing the text surrounding a search term we can restrict geoparsing to relevant parts of the corpus making the process quicker and easier to check for errors, a process known as concordance geoparsing (Rupp et al 2014). Figure 1a shows the PNCs created where toponyms are found within ten words of the search terms "tourist" and "tourists".…”
mentioning
confidence: 99%