2022
DOI: 10.7717/peerj-cs.914
|View full text |Cite
|
Sign up to set email alerts
|

Classification of movie reviews using term frequency-inverse document frequency and optimized machine learning algorithms

Abstract: The Internet Movie Database (IMDb), being one of the popular online databases for movies and personalities, provides a wide range of movie reviews from millions of users. This provides a diverse and large dataset to analyze users’ sentiments about various personalities and movies. Despite being helpful to provide the critique of movies, the reviews on IMDb cannot be read as a whole and requires automated tools to provide insights on the sentiments in such reviews. This study provides the implementation of vari… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 30 publications
(14 citation statements)
references
References 45 publications
0
14
0
Order By: Relevance
“…The algorithm is easily affected by the skew of the data set, such as a large number of documents in a certain category, which leads to the underestimation of IDF. IDF improvement algorithms such as TFIDF-FL ( Zhang et al, 2019 ) have been proposed, and some scholars have also suggested combining TF-IDF with Word2Vec to solve the shortcomings of TF-IDF ( Naeem et al, 2022 ); in short, simply using the TF-IDF algorithm to calculate semantic similarity leads to the problem of low accuracy.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…The algorithm is easily affected by the skew of the data set, such as a large number of documents in a certain category, which leads to the underestimation of IDF. IDF improvement algorithms such as TFIDF-FL ( Zhang et al, 2019 ) have been proposed, and some scholars have also suggested combining TF-IDF with Word2Vec to solve the shortcomings of TF-IDF ( Naeem et al, 2022 ); in short, simply using the TF-IDF algorithm to calculate semantic similarity leads to the problem of low accuracy.…”
Section: Resultsmentioning
confidence: 99%
“…The algorithm is easily affected by the skew of the data set, such as a large number of documents in a certain category, which leads to the underestimation of IDF. IDF improvement algorithms such as TFIDF-FL (Zhang et al, 2019) have been proposed, and some scholars have also suggested combining TF-IDF with Word2Vec to solve the shortcomings of TF-IDF (Naeem et al, 2022); in short, While the features of the SimHash algorithm are as mentioned above, its text similarity calculation is suitable for low-precision and high-speed scenarios. This calculation has lower requirements for speed but higher requirements for accuracy, which proves that SimHash is unsuitable for studying long texts or for high-precision similarity calculations.…”
Section: Analysis Of Calculationmentioning
confidence: 99%
“…Its also a lexicon-based technique to perform sentiment analysis on social media posts as we used it to annotate the dataset as negative, positive, and neutral in comparison with the TextBlob [35]. VADER also generates a compound score between −1 to 1 and a score greater than 0.05 represents the positive sentiment, less than −0.05 represents negative sentiment, and between these indicate the neutral sentiment.…”
Section: Vadermentioning
confidence: 99%
“…It has also been analyzed by different regressions [18] to predict popularity of the movies based on the genre information of the Kaggle dataset. Naeem et al applied gradient boosting classifiers, support vector machines (SVM), Naïve Bayes classifier, and random forest [19], while Sourav M. and Tanupriya C. applied Naïve Bayes and SVM [20] and both found that SVM is better than any other classifier for sentiment analysis of IMDB movie review text. Hasan B. and Serdar K. showed clustering based on the genre of a movie to compare the genres with respect to other features like rating, release year, and gross income [21].…”
Section: Related Workmentioning
confidence: 99%
“…According to Zhao [37] and Tarannum [38], web-scraping is cheaper, cleaner, and more automatic than web crawling. Data scientists also prefer HTTP protocol data collection methods for data retrieval from web pages [17,19,20]. It is popular in consultancy management, insurance, banking, online media, internet, network security, marketing, IT sectors, and computer software [39].…”
Section: Web Data Scrapingmentioning
confidence: 99%