Big Data systems are typically implemented in objectoriented languages such as Java and Scala due to the quick development cycle they provide. These systems are executed on top of a managed runtime such as the Java Virtual Machine (JVM), which requires each data item to be represented as an object before it can be processed. This representation is the direct cause of many kinds of severe ineiciencies. We developed Gerenuk, a compiler and runtime that aims to enable a JVM-based data-parallel system to achieve nearnative eiciency by transforming a set of statements in the system for direct execution over inlined native bytes. The key insight leading to Gerenuk's success is twofold: (1) analytics workloads often use immutable and conined data types. If we speculatively optimize the system and user code with this assumption, the transformation can be made tractable. (2) The low of data starts at a deserialization point where objects are created from a sequence of native bytes and ends at a serialization point where they are turned back into a byte sequence to be sent to the disk or network. This low naturally deines a speculative execution region (SER) to be transformed. Gerenuk compiles a SER speculatively into a version that can operate directly over native bytes that come from the disk or network. The Gerenuk runtime aborts the SER execution upon violations of the immutability and coninement assumption and switches to the slow path by deserializing the bytes and re-executing the original SER. Our evaluation on Spark and Hadoop demonstrates promising results. CCS Concepts • Information systems → Data management systems; • Software and its engineering → Compilers.
Managed languages such as Java and Scala are prevalently used in development of large-scale distributed systems. Under the managed runtime, when performing data transfer across machines, a task frequently conducted in a Big Data system, the system needs to serialize a sea of objects into a byte sequence before sending them over the network. The remote node receiving the bytes then deserializes them back into objects. This process is both performance-ine cient and labor-intensive: (1) object serialization/deserialization makes heavy use of re ection, an expensive runtime operation and/or (2) serialization/deserialization functions need to be handwritten and are error-prone. This paper presents Skyway, a JVM-based technique that can directly connect managed heaps of di erent (local or remote) JVM processes. Under Skyway, objects in the source heap can be directly written into a remote heap without changing their formats. Skyway provides performance bene ts to any JVM-based system by completely eliminating the need (1) of invoking serialization/deserialization functions, thus saving CPU time, and (2) of requiring developers to hand-write serialization functions. CCS Concepts • Information systems → Data management systems; • Software and its engineering → Memory management;
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.