Document clustering: TF-IDF approach

Bafna, Prafulla B.; Pramod, Dhanya; Vaidya, Anagha

doi:10.1109/iceeot.2016.7754750

Cited by 179 publications

(67 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Secondly, the data also contained many stop words which is never meaningful or useful in this context as explained in Section 2.2. Hence in order to filter those stop words, a list of 500 stop words was used first which filtered the data and removed all stop words from it [1], [4], [5]. A large list of stop words can easily be obtained from many blogs and websites where it is available for free for general public to consume.…”

Section: Data Preprocessingmentioning

confidence: 99%

“…Third, one need to count total number of words and their occurrences in all documents. Once these steps are performed, one can apply Term Frequency formula to calculate TF as discussed in Section 2.1 [1], [4], [6].…”

Section: Designmentioning

confidence: 99%

“…The processing of structured or semi-structured data in all organizations is becoming very difficult as the data has been increased tremendously [1], [2]. There are many techniques or algorithms that can be used to process data but this study is focused on one of those, known as TF-IDF.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents

Qaiser¹,

Ali²

2018

IJCA

501

155

View full text Add to dashboard Cite

In this paper, the use of TF-IDF stands for (term frequencyinverse document frequency) is discussed in examining the relevance of key-words to documents in corpus. The study is focused on how the algorithm can be applied on number of documents. First, the working principle and steps which should be followed for implementation of TF-IDF are elaborated. Secondly, in order to verify the findings from executing the algorithm, results are presented, then strengths and weaknesses of TD-IDF algorithm are compared. This paper also talked about how such weaknesses can be tackled. Finally, the work is summarized and the future research directions are discussed.

show abstract

Section: Data Preprocessingmentioning

confidence: 99%

Section: Designmentioning

confidence: 99%

See 1 more Smart Citation

Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents

Qaiser¹,

Ali²

2018

IJCA

501

155

View full text Add to dashboard Cite

show abstract

“…In addition, silhouette coefficient, which was proposed by Rousseeuw [28], has been widely used to evaluate clustering results [29,30]. In this study, we employed the mean silhouette value to evaluate the clustering results, which depended on the similarities between one document and both of the other documents in the same cluster and that in the most similar cluster.…”

Section: Experimental Design and Evaluation Indexmentioning

confidence: 99%

Research Front Detection and Topic Evolution Based on Topological Structure and the PageRank Algorithm

Zhang

et al. 2019

Symmetry

View full text Add to dashboard Cite

Research front detection and topic evolution has for a long time been an important direction for research in the informetrics field. However, most previous studies either simply use a citation count for scientific document clustering or assume that each scientific document has the same importance in detecting the clustering theme in a cluster. In this study, utilizing the topological structure and the PageRank algorithm, we propose a new research front detection and topic evolution approach based on graph theory. This approach is made up of three stages: (1) Setting a time window with appropriate length according to the accuracy of scientific documents clustering results and the time delay of a scientific document to be cited, dividing scientific documents into several time windows according to their years of publication, calculating similarities between them according to their topological structure, and clustering them in each time window based on the fast greedy algorithm; (2) combining the PageRank algorithm and keywords’ frequency to detect the clustering theme, which assumes that the more important a scientific document in the cluster is, the greater the possibility that it is cited by the other documents in the same cluster; and (3) reconstructing the cluster graph where nodes represent clusters and edges’ strengths represent the similarities between different clusters, then detecting research front and identifying topic evolution based on the reconstructed cluster graph. To evaluate the performance of our proposed approach, the scientific documents related to data mining and covered by Science Citation Index Expanded (SCI-EXPANDED) or Social Science Citation Index (SSCI) in Web of Science are collected as a case study. The experiment’s results show that the proposed approach can obtain reasonable clustering results, and it is effective for research front detection and topic evolution.

show abstract

“…Their experimentation results shows that the optimal weights of features computed by the algorithm improvises the retrieval results significantly. One of the numerical statistic "tf-idf" helps in placing the weightage of a particular word's importance in a document for text retrieval [18]. We extend the similar metric for image retreival as well by giving weightage to a particular feature based on the feedback from the end user and adjust this accordingly on every iteration.Yu Suzuki, Masahiro Mitsukawa and Kyoji Kawagoe [19] have used tf-idf approach to find the importance degree of features and by using their method, the CBIR system can find results matching the user query to the closest possible.…”

Section: Related Workmentioning

confidence: 99%

Feature Extraction in JPEG domain along with SVM for Content Based Image Retrieval

Hussain¹,

Surendran²,

Begum³

2018

IJET

View full text Add to dashboard Cite

Content Based Image Retrieval (CBIR) applies computer vision methods for image retreival purposes from the databases. It is majorly based on the user query, which is in visual form rather than the traditional text form. CBIR is applied in different fields extending from surveillance to remote sensing, E-purchase, medical image processing, security systems to historical research and many others. JPEG, a very commonly used method of lossy compression is used to reduce the size of the image before being stored or transmitted. Almost every digital camera in the market are storing the captured images in jpeg format. The storage industry has seen many major transformations in the past decades while the retrieval technologies are still developing. Though there are some breakthroughs happened in text retrieval, the same is not true for the image and other multimedia retrieval. Specifically image retreival has witnessed many algorithms in the spatial or the raw domain but since majority of the images are stored in the JPEG format, it takes time to decode the compressed image before extracting features and retrieving. Hence, in this research work, we focus on extracting the features from the compressed domain itself and then utilize support vector machines (SVM) for improving the retrieval results. Our proof of concept shows us that the features extracted in compressed domain helps retrieve the images 43% faster than the same set of images in the spatial domain and the accuracy is improved to 93.4% through SVM based feedback mechanism.

show abstract

Document clustering: TF-IDF approach

Cited by 179 publications

References 13 publications

Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents

Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents

Research Front Detection and Topic Evolution Based on Topological Structure and the PageRank Algorithm

Feature Extraction in JPEG domain along with SVM for Content Based Image Retrieval

Contact Info

Product

Resources

About