IntroductionStreaming data is increasingly important for online services. Facebook [1] and LinkedIn [2] have analysed event related data for understanding usage in their ecosystems. Twitter has created a big data streaming architecture, which is able to serve and process thousands of tweets in a second [3][4][5]. Also, several systems [6], methods [7,8], and benchmarking tools [9][10][11] have been created for facilitating implementation of tweet related processing and analysis by 3rd parties. Especially, new stream processing technologies (Spark [12], AsterixDB [13]) have been created, which could be selected for implementation of stream extraction, storage and analysis functionalities. Although stream processing performance has been studied [12,14], comparative feasibility analysis of the technologies has not been extensively performed in the context of semi-structured data processing.
AbstractFor getting up-to-date insight into online services, extracted data has to be processed in near real time. For example, major big data companies (Facebook, LinkedIn, Twitter) analyse streaming data for development of new services. Several technologies have been developed, which could be selected for implementation of stream processing functionalities. The contribution of this paper is feasibility analysis of technologies for stream-based processing of semi-structured data. Particularly, feasibility of a Big Data management system for semi-structured data (AsterixDB) will be compared to Spark streaming, which has been integrated with Cassandra NoSQL database for persistence. The study focuses on stream processing in a simulated social media use case (tweet analysis), which has been implemented to Eucalyptus cloud computing environment on a distributed shared memory multiprocessor platform. The results indicate that AsterixDB is able to provide significantly better performance both in terms of throughput and latency, when data feed functionality of AsterixDB is used, and stream processing has been implemented with Java. AsterixDB also scaled on the same level or better, when the amount of nodes on the cloud platform was increased. However, stream processing in AsterixDB was delayed by batching of data, when tweets were streamed into the database with data feeds. Pääkkönen J Big Data (2016) Big Data (2016) 3:6 This article focuses on performance analysis of Spark streaming, Cassandra, and AsterixDB technologies for stream processing of semi-structured social media data (tweets). Especially, Spark streaming has been integrated with Cassandra for data persistence, which has been compared to AsterixDB on Eucalyptus cloud environment on a DSM multiprocessor platform. The results indicated that AsterixDB achieved significantly higher throughput and lower latency, when data feeds were utilized and stream processing was implemented with Java. Performance of AsterixDB also scaled better, when the amount of nodes on the cloud platform was increased. However, stream processing in AsterixDB was delayed by batching of streamed tw...