Sophisticated middlebox services-such as network monitoring and intrusion detection, DDoS mitigation, worm scanning, XML parsing and protocol transformation-are becoming increasingly popular in today's Internet. To support highthroughput, these services are often deployed on Distributed Memory, Multi-processor (DM-MP) hardware platforms such as a cluster of network processors. Scaling the throughput of such platforms, however, is challenging because of the difficulties and overheads of accessing persistent, shared state maintained by the services.In this paper, we describe the design and implementation of Oboe, a run-time system for DM-MP platforms that addresses the above challenge through two foundations: (1) categoryspecific management of shared state, and (2) adaptive flowlevel load distribution for addressing persistent processor overload. Our simulations demonstrate that Oboe can achieve performance within 0-5% of an ideal adaptive system. Our prototype implementation of Oboe on a cluster of IXP2400 network processors, demonstrates the scalability achieved with increasing number of processors, number of flows and state size.
I. INTRODUCTIONThe designs of most modern network services follow a canonical computational model: a sequence of packets arrive at the service, get classified as belonging to a flow, undergo service-specific computation, and get forwarded to a subsequent service, a client or a server (see Figure 1). Examples of such middlebox services include network monitoring and intrusion detection [4], DDoS mitigation [24] and worm scanning [34], XML parsing [1], Enterprise Service Bus (ESB) protocol transformation, and ESB policy enforcement [1].To support high throughput (thousands to millions of packets per second), these services are often deployed on a distributed memory multi-processor (DM-MP) platform (e.g., a cluster of network processors, general-purpose processors, and co-processors [17]). The design of such platforms is guided by the following requirements: (1) scale the service throughput linearly with number of parallel processors; (2) minimize the increase in the per-packet processing time resulting from DM-MP architecture; (3) maintain intra-flow packet order even though multiple packets are processed in parallel; and (4) minimize the programming effort required to exploit parallelism on such platforms.Meeting these requirements has proved to be challenging because of three reasons. First, most of the modern network