Unsupervised Keyphrase Extraction for Web Pages

Haarman, Tim; Zijlema, Bastiaan; Wiering, Marco A.

doi:10.3390/mti3030058

Cited by 6 publications

(5 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As an extension to the index navigation application, a further goal is to develop and implement 'see also' functionality, so that when a user selects a particular file, the application suggests and provides links to related files in the indexed corpus. This kind of functionality, now widespread and familiar, is one outcome of the half a century of fundamental research into information retrieval to which an extensive literature attests (e.g., in relation to the present investigations, refs [1][2][3][4][5][6][7][8][9]. Increasingly, such functionality is delivered by AI-based approaches such as those developed by, for example, UNSILO (https://unsilo.ai) and Yewno (https://www.yewno.com).…”

Section: Aims and Contextmentioning

confidence: 88%

Phrase indexing and the identification of related academic research content

Powell¹

2020

Preprint

View full text Add to dashboard Cite

Work to automate the identification of related articles in corpora of academic research content is described. Pairs of related articles are recognised on the basis of the phrases they contain, using a similarity measure that emphasizes the importance of phrase overlap. Phrases are weighted according to their significance, evaluated in terms of statistical under-or over-representation relative to corpus-level frequency, and the significance scores of n-grams with higher n values are boosted. The measure proves broadly effective at identifying meaningfully related pairs of content items and may provide a useful basis for the development of 'see also'-type functionality.

show abstract

Section: Aims and Contextmentioning

confidence: 88%

Phrase indexing and the identification of related academic research content

Powell¹

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…As an extension to the index navigation application, a further goal is to develop and implement 'see also' functionality, so that when a user selects a particular file, the application suggests and provides links to related files in the indexed corpus. This kind of functionality, which is now widespread and familiar, represents the fruition of half a century of fundamental research into information retrieval, to which an extensive literature attests (e.g., in relation to the present investigations, refs [1][2][3][4][5][6][7][8][9]. Increasingly, such functionality is delivered by AI-based technologies such as those developed by, for example, UNSILO (https://unsilo.ai) and Yewno (https://www.yewno.com).…”

Section: Aims and Contextmentioning

confidence: 96%

“…def find_intersection(dictA, dictB): # Return the elements that are common to dictA and dictB # Inputs are now dictionaries of the form {phr_id: signif, ... } intersect = [] sum_A_sigs = 0.0 sum_B_sigs = 0.0 for key in dictA: # key is a phrase ID sig1 = dictA[key] # sig1: significance in file A of phrase with ID value = key if key in dictB: sig2 = dictB[key] # sig2: significance in file B of phrase with ID value = key intersect.append( [key] ) # add to intersection if phrase is common to A and B sum_A_sigs = sum_A_sigs + sig1 sum_B_sigs = sum_B_sigs + sig2 # Compute the significance-based weighting by which to multiply the A-B link strength # -the product of the average significance values for A and B: intersect_size = len(intersect) av_A_sig = sum_A_sigs / (1+intersect_size) av_B_sig = sum_B_sigs / (1+intersect_size) tot_sig = av_A_sig * av_B_sig # Return the intersection size and combined significance: return (intersect_size, tot_sig) (vi) The overall similarity between files A and B is calculated as the product of an adjusted version of the total significance, taking into account phrase frequency statistics and the number of phrase components as outlined above, and the Overlap Multiplication (OM) term already discussed: 4 AB_score = (math.sqrt(tot_sig) * 100) * ( intersect**2 / ((1+A_only) * (1+B_only)) )…”

Section: Topical Relationshipsmentioning

confidence: 99%

“…Initially standard Python lists were used for storing each file's phrase ID/phrase significance pairs, and to find the intersection involved nested for loops to iterate over the files A (outer loop) and then, for each file A, over the files B (inner loop). Switching to the dictionary-based approach outlined above reduced the time taken to compute the similarities for the test corpus by very roughly an order of magnitude 4. The adjustment to the significance (involving square root and multiplication operations) was introduced pragmatically in order to temper the impact of the significance weighting on the overall measure, when initial experiments raised the possibility that its effect in combination with the OM term was perhaps excessive.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Phrase indexing and the identification of related academic research content

Powell¹

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Various keyphrase extraction methods have been developed to support the aforementioned applications [8], [9], [7], [10], [11], [12]. Domain-specific strategies [9], for example, need knowledge of the application domain, whereas linguistic approaches [9] demand language proficiency.…”

Section: Introductionmentioning

confidence: 99%

Keyphrases Concentrated Area Identification from Academic Articles as Feature of Keyphrase Extraction: A New Unsupervised Approach

Miah¹,

Awang²,

Azad³

et al. 2022

IJACSA

View full text Add to dashboard Cite

The extraction of high-quality keywords and summarising documents at a high level has become more difficult in current research due to technological advancements and the exponential expansion of textual data and digital sources. Extracting high-quality keywords and summarising the documents at a highlevel need to use features for the keyphrase extraction, becoming more popular. A new unsupervised keyphrase concentrated area (KCA) identification approach is proposed in this study as a feature of keyphrase extraction: corpus, domain and language independent; document length-free; utilized by both supervised and unsupervised techniques. In the proposed system, there are three phases: data pre-processing, data processing, and KCA identification. The system employs various text pre-processing methods before transferring the acquired datasets to the data processing step. The pre-processed data is subsequently used during the data processing step. The statistical approaches, curve plotting, and curve fitting technique are applied in the KCA identification step. The proposed system is then tested and evaluated using benchmark datasets collected from various sources. To demonstrate our proposed approach's effectiveness, merits, and significance, we compared it with other proposed techniques. The experimental results on eleven (11) datasets show that the proposed approach effectively recognizes the KCA from articles as well as significantly enhances the current keyphrase extraction methods based on various text sizes, languages, and domains.

show abstract

Unsupervised Keyphrase Extraction for Web Pages

Cited by 6 publications

References 22 publications

Phrase indexing and the identification of related academic research content

Phrase indexing and the identification of related academic research content

Phrase indexing and the identification of related academic research content

Keyphrases Concentrated Area Identification from Academic Articles as Feature of Keyphrase Extraction: A New Unsupervised Approach

Contact Info

Product

Resources

About