2021
DOI: 10.1108/jd-02-2021-0032
|View full text |Cite
|
Sign up to set email alerts
|

Named-entity recognition for early modern textual documents: a review of capabilities and challenges with strategies for the future

Abstract: PurposeBy mapping-out the capabilities, challenges and limitations of named-entity recognition (NER), this article aims to synthesise the state of the art of NER in the context of the early modern research field and to inform discussions about the kind of resources, methods and directions that may be pursued to enrich the application of the technique going forward.Design/methodology/approachThrough an extensive literature review, this article maps out the current capabilities, challenges and limitations of NER… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
9
0
2

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 15 publications
(11 citation statements)
references
References 33 publications
0
9
0
2
Order By: Relevance
“…To improve the accuracy of the non-NIL candidates provided by the EL system, we use a post-processing filter 14 based on heuristics and data provided by Wikidata and DBpedia. The goals of the filter are to: (1) Remove candidates which are unlikely such as disambiguation pages or people born after the document publication; (2) Verify that the tokens of a particular named entity are linked to the same candidates;…”
Section: Filteringmentioning
confidence: 99%
See 1 more Smart Citation
“…To improve the accuracy of the non-NIL candidates provided by the EL system, we use a post-processing filter 14 based on heuristics and data provided by Wikidata and DBpedia. The goals of the filter are to: (1) Remove candidates which are unlikely such as disambiguation pages or people born after the document publication; (2) Verify that the tokens of a particular named entity are linked to the same candidates;…”
Section: Filteringmentioning
confidence: 99%
“…However, this has not been the case for historical documents, due to certain characteristics, which make their processing particularly difficult. A few exceptions exist [9,13,14], but in far smaller numbers than for contemporary documents. Among the challenges, such tools need to be able to deal with errors produced by OCR systems, to manage some specific vocabulary, and also to handle spelling variations with respect to modern standards.…”
Section: Introductionmentioning
confidence: 99%
“…For example, in this case, the GIS system had mislabeled the coordinates of Thean Hou Temple in Kuala Lumpur as the Taichung Mazu Temple in Taiwan. In Europe and the USA, this problem is mostly solved by combining the annotations on geographical coordinates with the authoritative tables provided by a Gazetteer, such as Getty Thesaurus of Geographic Names , Pleiades and so forth (Humbel et al , 2021). Thus, it is suggested that the geographic information regarding the Malaysian Chinese organization should also be established with the same kind of authoritative table from sources, such as the Roster of Malaysian Chinese Associations , to reduce the errors in geographic information.…”
Section: Discussionmentioning
confidence: 99%
“…The researcher pointed out that although CNER can successfully identify most of the character entities, it is difficult to directly identify the characters’ real names as some of them are aliases or honorifics. In the future, a much more reliable glossary of names should be established by making use of the organizational registers and donation registers of local Chinese societies to enhance the recognition accuracy of named entities (Humbel et al , 2021; Xu et al , 2020). Furthermore, future research may focus on collecting more external information, such as character representations and cross-lingual information, and maintaining more and higher quality CNER data sets like ENER, to improve the model performance of CNER (Liu et al , 2022).…”
Section: Discussionmentioning
confidence: 99%
“…There are four approaches to NER: i) a rule-based approach, which does not require annotated data because it relies on artificial rules; ii) an unsupervised learning approach; iii) a feature-based supervised learning approach that relies on supervised learning algorithms with careful feature engineering; and iv) a deep-learning-based approach, which automatically finds the required representation for detecting or classifying raw input in an end-to-end manner [3], [4]. NER is a straightforward process for humans because many named entities are self-names, and most of them have initial capital letters and can be easily recognized, but for machines, it is very difficult [5]. Information extraction often uses data available on social media, online news, and e-commerce [3].…”
Section: Introductionmentioning
confidence: 99%