Empirical insights into promising commercial sentiment analysis solutions that go beyond the claims of their vendors are rare. Moreover, due to the constant evolution in the field, previous studies are far from reflecting the current situation. The goal of this article is to evaluate and compare current solutions using two experimental studies. In the first part of the study, based on tweets about airline service quality, we test the solutions of six vendors with different market power, such as Amazon, Google, IBM, Microsoft, Lexalytics, and MeaningCloud, and report their measures of accuracy, precision, recall, (macro)F1, time performance, and service level agreements (SLA). Furthermore, we compare two of the services in depth with multiple data sets and over time. The services tested here are Google Cloud Natural Language API and MeaningCloud Sentiment Analysis API. For evaluating the results over time, we use the same data set as in November 2020. In addition, further topic-specific and general Twitter data sets are used. The experiments show that the IBM Watson NLU and Google Cloud Natural Language API solutions may be preferred when negative text detection is the primary concern. When tested in July 2022, the Google Cloud Natural Language API was still the clear winner compared to the MeaningCloud Sentiment Analysis API, but only on the airline service quality data set; on the other data sets, both services provided specific benefits and drawbacks. Furthermore, we detected changes in the sentiment classification over time with both services. Our results motivate that an independent, critical, and longitudinal experimental analysis of sentiment analysis services can provide interesting insights into their overall reliability and particular classification accuracy beyond marketing claims to critically compare solutions based on real data and analyze potential weaknesses and margins of error before making an investment.