Keyword Extraction is an important task in several text analysis endeavours.In this paper, we present a critical discussion of the issues and challenges in graph-based keyword extraction methods, along with comprehensive empirical analysis. We propose a parameterless method for constructing graph of text that captures the contextual relation between words. A novel word scoring method is also proposed based on the connection between concepts. We demonstrate that both proposals are individually superior to those followed by the sate-of-theart graph-based keyword extraction algorithms. Combination of the proposed graph construction and scoring methods leads to a novel, parameterless keyword extraction method (sCAKE) based on semantic connectivity of words in the document.Motivated by limited availability of NLP tools for several languages, we also design and present a language-agnostic keyword extraction (LAKE) method.We eliminate the need of NLP tools by using a statistical filter to identify candidate keywords before constructing the graph. We show that the resulting method is a competent solution for extracting keywords from documents of languages lacking sophisticated NLP support.
In this paper, we present a supervised framework for automatic keyword extraction from single document. We model the text as complex network, and construct the feature set by extracting select node properties from it. Several node properties have been exploited by unsupervised, graph-based keyword extraction methods to discriminate keywords from non-keywords. We exploit the complex interplay of node properties to design a supervised keyword extraction method.The training set is created from the feature set by assigning a label to each candidate keyword depending on whether the candidate is listed as a goldstandard keyword or not. Since the number of keywords in a document is much less than non-keywords, the curated training set is naturally imbalanced. We the pre-trained model on Hindi and Assamese language documents. We observe that the model performs equally well for the cross-language text even though it was trained only on English language documents. This shows that the proposed method is independent of the domain, collection, and language of the training corpora.
In the era of heavy emphasis on deep neural architectures with well-proven and impressive competencies albeit with massive carbon footprints, we present a simple and inexpensive solution founded on Bidirectional Encoder Representations from Transformers Next Sentence Prediction (BERT NSP) task to localize sequential discourse coherence in a text. We propose Fast and Frugal Coherence Detection (FFCD) method, which is an effective tool for the author to incarcerate regions of weak coherence at the sentence level and reveal the extent of overall coherence of the document in near real-time. We leverage the pre-trained BERT NSP model for sequential coherence detection at sentence-to-sentence transition and evaluate the performance of the proposed FFCD approach on coherence detection tasks using publicly available datasets. The mixed performance of our solution compared to state-of-the-art methods for coherence detection invigorates efforts to design explainable and inexpensive solutions downstream of the existing upscale language models.
We describe our approach for the 1st Computational Linguistics Lay Summary Shared Task CL-LaySumm20. The task is to produce nontechnical summaries of scholarly documents. The summary should be within easy grasp of a layman who may not be well versed with the domain of the research article. We propose a two step divide-and-conquer approach. First, we judiciously select segments of the documents that are not overly pedantic and are likely to be of interest to the laity, and over-extract sentences from each segment using an unsupervised network based method. Next, we perform abstractive summarization on these extractions and systematically merge the abstractions. We run ablation studies to establish that each step in our pipeline is critical for improvement in the quality of lay summary. Our approach leverages state-of-the-art pre-trained deep neural network based models as zero-shot learners to achieve high scores on the task.
In this paper, we propose a semi-automatic system for title construction from scientific abstracts. The system extracts and recommends impactful words from the text, which the author can creatively use to construct an appropriate title for the manuscript. The work is based on the hypothesis that keywords are good candidates for title construction. We extract important words from the document by inducing a supervised keyword extraction model. The model is trained on novel features extracted from graph-of-text representation of the document. We empirically show that these graph-based features are capable of discriminating keywords from non-keywords. We further establish empirically that the proposed approach can be applied to any text irrespective of the training domain and corpus. We evaluate the proposed system by computing the overlap between extracted keywords and the list of title-words for documents, and we observe a macro-averaged precision of 82%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.