We propose the χ-index as a bibliometric indicator that generalises the h-index. While the h-index is determined by the maximum square that fits under the citation curve of an author when plotting the number of citations in decreasing order, the χ-index is determined by the maximum area rectangle that fits under the curve. The height of the maximum rectangle is the number of citations ck to the kth most-cited publication, where k is the width of the rectangle. The χ-index is then defined as , for convenience of comparison with the h-index and other similar indices. We present a comprehensive empirical comparison between the χ-index and other bibliometric indices, focusing on a comparison with the h-index, by analysing two datasets—a large set of Google Scholar profiles and a small set of Nobel prize winners. Our results show that, although the χ and h indices are strongly correlated, they do exhibit significant differences. In particular, we show that, for these data sets, there are a substantial number of profiles for which χ is significantly larger than h. Furthermore, restricting these profiles to the cases when ck > k or ck < k corresponds to, respectively, classifying researchers as either tending to influential, i.e. having many more than h citations, or tending to prolific, i.e. having many more than h publications.
It is of great interest to researchers and scholars in many disciplines (particularly those working on cultural heritage projects) to study parallel passages (i.e., identical or similar pieces of text describing the same thing) in digital text archives. Although there exist a few software tools for this purpose, they are restricted to a specific domain (e.g., the Bible) or a specific language (e.g., Hebrew). In this paper, we present in detail how we build a digital infrastructure that can facilitate the search and discovery of parallel passages for any domain in any language. It is at the core of our Samtla (Search And Mining Tools with Linguistic Analysis) system designed in collaboration with historians and linguists. The system has already been used to support research on five large text corpora that span a number of different domains and languages. The key to such a domain-independent and language-independent digital infrastructure is a novel combination of a character-based n-gram language model, space-optimised suffix tree, generalised edit distance. A comprehensive evaluation through crowd-sourcing shows that the effectiveness of our system's search functionality is on par with the human-level performance.
We propose a two-dimensional bibliometric index that strikes a balance between quantity (as measured by the number of publications of a researcher) and quality (as measured by the number of citations to those publications). While the square of h-index is determined by the maximum area square that fits under the citation curve of an author when plotting the number of citations in decreasing order, the rec-index is determined by the maximum area rectangle that fits under the curve. In this context we may distinguish between authors with a few very highly-cited publications, who may have carried out some influential research, and prolific authors, who may have many publications but fewer citations per publication. The influence of a researcher may be measured via a restricted version of the rec-index, the rec I -index, which is the maximum area vertical rectangle that fits under the citation curve. Similarly, the prolificity of a researcher may be measured via the rec P -index, which is the maximum area horizontal rectangle that fits under the citation curve. This leads to the proposal of the two-dimensional bibliometric index (rec I , rec P ), which captures both aspects of a researcher's output. We present a comprehensive empirical analysis of this two-dimensional index on two datasets: a large set of Google Scholar profiles (representing "typical" researchers) and a small set of Nobel prize winners. Our results demonstrate the potential of this two-dimensional index, since for both data sets there is a statistically significant number of researchers for whom rec I is greater than rec P . In particular, for nearly 25% of the Google Scholar researchers and for nearly 60% of the Nobel prize winners, rec I is greater than rec P .
Purpose The purpose of this paper is to present a language-agnostic approach to facilitate the discovery of “parallel passages” stored in historic and cultural heritage digital archives. Design/methodology/approach The authors explore a novel, and relatively simple approach, using a character-based statistical language model combined with a tailored version of the Basic Local Alignment Tool to extract exact and approximate string patterns shared between groups of documents. Findings The approach is applicable to a wide range of languages, and compensates for variability in the text of the documents as a result of differences in dialect, authorship, language change over time and errors due to inaccurate transcriptions and optical character recognition errors as a result of the digitisation process. Research limitations/implications A number of case studies demonstrate that the approach is practical and generalisable to a wide range of archives with documents in different languages, domains and of varying quality. Practical implications The approach described can be applied to any digital archive of modern and contemporary texts. This makes the approach applicable to digital archives recording historic texts, but also those composed of more recent news articles, for example. Social implications The analysis of “parallel passages” enables researchers to quantify the presence and extent of text-reuse in a collection of documents, which can provide useful data on author style, text genres and cultural contexts. Originality/value The approach is novel and addresses a need by humanities researchers for tools that can identify similar documents and local similarities represented by shared text sequences in a potentially vast large archive of documents. As far as the authors are aware, there are no tools currently exist that provide the same level of tolerance to the language of the documents.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.