WebRank: Language-Independent Extraction of Keywords from Webpages

Shah, Himat; Mariescu-Istodor, Radu; Fränti, Pasi

doi:10.1109/pic53636.2021.9687047

2021 IEEE International Conference on Progress in Informatics and Computing (PIC) 2021

DOI: 10.1109/pic53636.2021.9687047

|View full text |Cite

WebRank: Language-Independent Extraction of Keywords from Webpages

Himat Shah

Radu Mariescu-Istodor

Pasi Fränti

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2022

Publication Types

Select...

Article1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Combining statistical, structural, and linguistic features for keyword extraction from web pages

Shah¹,

Fränti

2022

ACI

View full text Add to dashboard Cite

<abstract> <p>Keywords are commonly used to summarize text documents. In this paper, we perform a systematic comparison of methods for automatic keyword extraction from web pages. The methods are based on three different types of features: statistical, structural and linguistic. Statistical features are the most common, but there are other clues in web documents that can also be used. Structural features utilize styling codes like header tags and links, but also the structure of the web page. Linguistic features can be based on detecting synonyms, semantic similarity of the words and part-of-speech tagging, but also concept hierarchy or a concept graph derived from Wikipedia. We compare different types of features to find out the importance of each of them. One of the key results is that stop word removal and other pre-processing steps are the most critical. The most successful linguistic feature was a pre-constructed list of words that had no synonyms in <italic>WordNet</italic>. A new method called <italic>ACI‑rank</italic> is also compiled from the best working combination.</p> </abstract>

show abstract

Combining statistical, structural, and linguistic features for keyword extraction from web pages

Shah¹,

Fränti

2022

ACI

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

WebRank: Language-Independent Extraction of Keywords from Webpages

Cited by 1 publication

References 23 publications

Combining statistical, structural, and linguistic features for keyword extraction from web pages

Combining statistical, structural, and linguistic features for keyword extraction from web pages

Contact Info

Product

Resources

About