Coordinate Checkpoint Mechanism on Real-Time Messaging System in Kafka Pipeline Architecture

Aung, Thandar; Min, Hla Yin; Maw, Aung Htein

doi:10.1109/aitc.2019.8921392

Cited by 9 publications

(2 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The advantage of this design is that when the amount of log data is large at a particular time, Kafka is used as a buffer to play a role of peak protector and prevent denial of service and network congestion caused by too much instantaneous data using Flume alone. (2) For the data transfer process various optimizations are made for cluster crash, data loss, data duplication and other problems during data transfer. (3) Hive uses Hadoop's MapReduce computation engine by default, and all HQL statements are translated to MRJob execution, which is inefficient.…”

Section: Discussionmentioning

confidence: 99%

“…We investigate and design a Hive-based big data platform that processes and analyzes entire processes as part of an in-depth analysis of Hadoop's big data ecology, accompanied by a dramatic demonstration of how big data is used in real-world production environments. For peak shaving and decoupling, Flume and Sqoop are used to collect log data and business data unified, while Kafka serves as a buffer for Flume [2]. A custom interceptor on the first Flume layer is used for simple data cleansing, intercepting unformatted Json strings to prevent Hive post-order parsing.…”

Section: Research Process Directionmentioning

confidence: 99%

See 1 more Smart Citation

The realization and application of the data analysis platform of netizen behavior based on Hive

Zhao

2023

ACE

View full text Add to dashboard Cite

With advances in mobile technology and mobile Internet applications, smart mobile devices, such as smartphones and tablets, have become increasingly popular, and the number of Internet users worldwide continues to grow. In the Internet era, the amount of data is growing exponentially and companies must be able to harness the value of the vast amount of data. Data platforms must integrate massive amounts of data collection, storage, computation and analysis to meet these opportunities and challenges. In this study, the log data of Internet users browsing websites are analyzed and the technologies used in the platform are briefly described. Finally, a draft platform for analyzing offline Internet user behavior data is proposed, taking into account the current common needs of different industries, while incorporating some innovations. Three modules are designed and implemented: data collection, data warehouse and data visualization. The user's data is mainly collected by the data collection module. The data warehouse is mainly responsible for cleaning, modeling and analyzing the data. As part of the data visualization module, the result data from the ADS layer is used as a template to create tables in MySQL, export the results to MySQL periodically using the Sqoop tool, and visualize the data using the data visualization tool. With Flume, Kafka and Sqoop, HDFS is used as the data storage framework, Hive is used as the storage tool, and Spark is used as the Hive computation engine to build the platform in a large context to analyze Internet user behavior.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Research Process Directionmentioning

confidence: 99%