Abstract:The effect of different high-volume traffic on big data applications impose stringent requirements on networks. We investigate drawbacks of segregating big data and elephant flows and propose ways to address the problem using optical network. OCIS codes: (060.4253) General; (060.0060) General [Fiber optics and optical communications]
IntroductionBig data analysis relies on distributed architecture frameworks such as Hadoop®, Spark TM etc. for managing large datasets of unprecedented volume for business analytics. These frameworks leverage their strength over the network infrastructure that their nodes communicate over, syncing with CPU and I/O resources. Behavior and performance of Hadoop® clusters in datacenters is effected by size of nodes, data size and workload types as well as networking characteristics. Network speed and latency play an important role in Hadoop® job completion times but more importantly they are impacted by the availability and resiliency features, traffic bursting nature and subscription ratio.There have been several studies [1], [2], pointing the benefits of Software Defined Networking (SDN) for handling network impact on big data workloads. Typically they address the network impact from communication patterns of Hadoop®, which is the most popular big data framework owing to its availability and reliability advantage. These Hadoop based SDN studies analyze communication pattern via network devices or via application awareness. Once the pattern is analyzed then end-to-end flows are setup to optimize the network thereby reducing Hadoop® job completion times. In both cases an SDN controller uses a protocol such as OpenFlow to make intelligent routing decisions, configuring flows on queues or use packet scheduling schemes to improve performance.
Furthermore, owing to the communication patterns of big data, optics have renewed interest in data. [3][4][5]We describe new datacenter architecture with SDN for addressing incast (many-to-one), multicast (one-to-many) and Allto-all cast patterns observed in big data based datacenters. They utilize low-power, high bandwidth circuit switches combined with low-cost passive optical devices (splitters, combiners etc) to handle long lived, high volume elephant traffic flows providing better performance for latency sensitive flows.Fewer studies address the combined effect of big data and other datacenter traffic flows. Since datacenters run different application traffic like web frontend, VM migration, large data transfers etc., along with Hadoop® it is important to investigate how other flows behave with big data flows. Especially the influence of elephant flows, which may decrease Hadoop's® as well as its own performance.In this paper we look at this effect of elephant flows on Hadoop® traffic and how it increases the job completion times. We demonstrate how existing methods of sharing packet queues via SDN control plane lead to increase latencies, resource utilization and how it affects all traffic types. Then make a case for packet-optical hyb...