Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents

Qaiser, Shahzad; Ali, Ramsha

doi:10.5120/ijca2018917395

Cited by 504 publications

(227 citation statements)

References 6 publications

Supporting

Mentioning

208

Contrasting

Unclassified

Order By: Relevance

“…Архив VectorModel.zip содержит следующие файлы: codes.txt -коды для дерева Хаффмана [23]; config.json -настройки алгоритма Doc2Vec; frequencies.txt -метрики tf-idf [24] и bag-of-words [25]; huffman.txt -координаты точек дерева Хаффмана; labels.txt -список идентификаторов страниц документации в формате base64; syn0.txt -веса связей между входными и скрытыми слоями нейронной сети; syn1.txt -веса связей между скрытыми и выходными слоями нейронной сети.…”

Section: программные и аппаратные средстваunclassified

Automated approach to semantic search through software documentation based on Doc2Vec algorithm

Ковалев¹,

Nikiforov²,

Дробинцев³

2021

ICS

View full text Add to dashboard Cite

Introduction: An important stage in a software development life cycle is the support phase, when customers can contact the support service of the supplier company and request a solution to an issue encountered in the software. To solve the request, engineers often have to refer to the relevant documentation. In order to reduce the complexity of the maintenance phase, the search for the necessary documentation pages can be automated. Purpose: Development of an approach to semantic search through documentation using Doc2Vec machine learning algorithm in order to automate the solution of customer requests. Results: An approach is proposed to semantic search through text documentation files and wiki pages using Doc2Vec machine learning algorithm. The documentation pages with semantic similarities to the textual description of an unresolved customer request help the engineer to process the request more efficiently and rapidly. Based on the proposed approach, a software tool has been developed which provides the engineer with a report containing links to documentation pages semantically related to the unresolved request. During the configuration of this tool, the optimal parameters of the Doc2Vec algorithm were found, providing the necessary quality of the semantic search. The idea of the experiment was to apply the tool to unresolved requests and evaluate its effectiveness. The developed approach and software tool were successfully tested in an open source Apache Kafka project. In the course of the experiment, 100 requests from Jira bug tracking system were downloaded and analyzed. The experimental results show the advantage of using the tool in software product support. The average documentation analysis time has been reduced as compared to the traditional manual approach. Practical relevance: The research results were used to solve real customer requests. The developed approach and the software implemented on its basis can reduce the complexity of the maintenance phase.

show abstract

Section: программные и аппаратные средстваunclassified

Automated approach to semantic search through software documentation based on Doc2Vec algorithm

Ковалев¹,

Nikiforov²,

Дробинцев³

2021

ICS

View full text Add to dashboard Cite

show abstract

“…The inverse-document-frequency component scales each word's value based on how many documents within the corpus contain it, attributing more importance to words that show up in a smaller subset of the overall corpus. This serves as a method of reducing the relevancy attached to words which appear very commonly in every document, and may not be useful in distinguishing between them [31]. Table 3 presents notional bag-of-words and TF-IDF matrices stemming from the same data.…”

Section: Non-encoded Datamentioning

confidence: 99%

Natural Language Processing Based Method for Clustering and Analysis of Aviation Safety Narratives

2020

View full text Add to dashboard Cite

The complexity of commercial aviation operations has grown substantially in recent years, together with a diversification of techniques for collecting and analyzing flight data. As a result, data-driven frameworks for enhancing flight safety have grown in popularity. Data-driven techniques offer efficient and repeatable exploration of patterns and anomalies in large datasets. Text-based flight safety data presents a unique challenge in its subjectivity, and relies on natural language processing tools to extract underlying trends from narratives. In this paper, a methodology is presented for the analysis of aviation safety narratives based on text-based accounts of in-flight events and categorical metadata parameters which accompany them. An extensive pre-processing routine is presented, including a comparison between numeric models of textual representation for the purposes of document classification. A framework for categorizing and visualizing narratives is presented through a combination of k-means clustering and 2-D mapping with t-Distributed Stochastic Neighbor Embedding (t-SNE). A cluster post-processing routine is developed for identifying driving factors in each cluster and building a hierarchical structure of cluster and sub-cluster labels. The Aviation Safety Reporting System (ASRS), which includes over a million de-identified voluntarily submitted reports describing aviation safety incidents for commercial flights, is analyzed as a case study for the methodology. The method results in the identification of 10 major clusters and a total of 31 sub-clusters. The identified groupings are post-processed through metadata-based statistical analysis of the learned clusters. The developed method shows promise in uncovering trends from clusters that are not evident in existing anomaly labels in the data and offers a new tool for obtaining insights from text-based safety data that complement existing approaches.

show abstract

“…TF is used to measure the number of times a word term is in a document. IDF is used to give lower weight to words that occur frequently and to give larger words to words that occur rarely [27]. At this stage, the TF-IDF feature is carried out in the weighting stage on each word that appears in the commentary words.…”

Section: Stemmingmentioning

confidence: 99%

Sentiment Analysis for Detecting Cyberbullying Using TF-IDF and SVM

Prabowo¹,

Azizah²

2020

RESTI

View full text Add to dashboard Cite

Social media has become a new method of today’s communication in a new digitalize era. Children and adults have used social media a lot in interacting with others. Therefore social media has shifted conventional communication into digital one. This digital development on social media is a serious problem that must be faced because it has been found that there are more and more acts of cyberbullying. This act of cyberbullying can attack the psychic, causing depression up to suicide. The dangers of cyberbullying are troubling and cause concern to the community. Therefore, this study will analyze the sentiment on the comments contained on social media to find out the value of sentiment from comments on social media platforms. The comment data will be processed at the preprocessing stage, Term Frequency-Inverse Document Frequency (TF-IDF), and the Support Vector Machine (SVM) classification method. Comment data to be classified as 1500 data taken using crawling data through libraries in python programming and divided into 80% data training and 20% data testing. Based on the results of the test, the accuracy value is 93%, the precision value is 95%, and the recall value is 97%. In this research, a system model design is also carried out where the system can be integrated with the browser to open a user page on the classification of comments that have been input into the system.

show abstract

Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents

Cited by 504 publications

References 6 publications

Automated approach to semantic search through software documentation based on Doc2Vec algorithm

Automated approach to semantic search through software documentation based on Doc2Vec algorithm

Natural Language Processing Based Method for Clustering and Analysis of Aviation Safety Narratives

Sentiment Analysis for Detecting Cyberbullying Using TF-IDF and SVM

Contact Info

Product

Resources

About