How “hard” is thread partitioning and how “bad” is a list scheduling based partitioning algorithm?

Tang, Xinan; Gao, Guang R.

doi:10.1145/277651.277679

Cited by 11 publications

(5 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since connection-affinity based parallelizing approach divides global connection state table into independent sub-tables, and each subtable is private to its corresponding CPU core, it makes the sequential code one step closer to run in parallel by transforming a global table access with an extra pointer based dereferences. 7) In general, automatically parallelizing a sequential application is a hard problem with limited success [23] [24]. We believe that domain knowledge could help lead to a viable solution in automatically parallelizing a sequential application.…”

Section: Design Principlesmentioning

confidence: 99%

Practice of parallelizing network applications on multi-core architectures

Wang

Cheng

Hua

et al. 2009

Proceedings of the 23rd International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

The industry wide shift to multi-core architectures arouses great interests in parallelizing sequential applications. However, it is very difficult to parallelize fine-grained applications for multicore architectures due to insufficient hardware support of fast communication and synchronization. Fortunately, network applications can be decomposed into pipelined structures that are amenable to streaming based parallel processing. To realize the potential of pipelining on multi-core architectures, it requires reevaluating the basic tradeoffs in parallel processing, including the ones between load balance and data locality and between general lock mechanisms and special lock-free data structures. This paper presents the practice of building a high-performance multi-core based network processing platform in which connection-affinity and lock-free design principles are applied effectively for better data locality and faster core-to-core synchronization and communication.We parallelize a complete Layer 2 to Layer 7 (L2-L7) network processing system on an Intel Core 2 Quad processor, including a TCP/IP stack based on Libnids (L2-L4) and a port-independent protocol identification engine by deep packet inspection (L7+). Furthermore, we develop a compiling method to transform sequential network applications to parallel ones to enable those applications to run on multi-core architectures. Our experience suggests that (1) fine-grained pipelining can be a good software solution for parallelizing network applications on multi-core architectures if connection-affinity and lock-free are used as the first design principles; (2) a delicate partitioning scheme is required to map pipelined structures onto specific multi-core architecture; (3) an automatic parallelization approach can work if domain knowledge is considered in the parallelizing process. Our multi-core based network processing platform can deliver not only 6Gbps processing speed for large packet sizes but also more challenging 2Gbps speed for smaller packets.

show abstract

Section: Design Principlesmentioning

confidence: 99%

Practice of parallelizing network applications on multi-core architectures

Wang

Cheng

Hua

et al. 2009

Proceedings of the 23rd International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

show abstract

“…In [43], we have also proved that when $ is zero the run length produced by any list-scheduling-based partitioning algorithm is not twice longer than that of an optimal solution. These two results form a foundation to apply heuristic algorithms in practice.…”

Section: -Partitionmentioning

confidence: 84%

Automatically Partitioning Threads for Multithreaded Architectures

Tang¹,

Gao²

1999

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

“…Then at most t tasks can be executed concurrently on this processor. Multi-threading is important for hiding communication latency and overlapping computation with communication [Tang and Gao, 1998]. We have shown how to build parameterized polyhedrons that describe starting tasks.…”

Section: Multi-threaded Execution Of Clustersmentioning

confidence: 99%

Automatic Parallelization Techniques Based on Compact Dag Extraction and Symbolic Scheduling

Cosnard

Jeannot

2001

Parallel Process. Lett.

View full text Add to dashboard Cite

Symbolic allocation and dynamic scheduling of tasks on a distributed memory machine for coarse-grained applications represented by parameterized task graphs (PTG) are presented in this paper. A PTG is a new computation model for symbolically representing directed acyclic task graphs (DAGs). The size of a PTG is independent of the problem size and its parameters can be instantiated at run time. Parameterindependent optimization is important for exploiting non-static parallelism in scientific computing programs with varying problem sizes. Previous DAG scheduling algorithms are not able to handle such cases. We present and study a symbolic scheduling algorithm called SLC (Symbolic Linear Clustering) which derives task clusters from a PTG using affine piecewise mapping functions and then evenly assigns clusters to processors. Thus a complete automatic parallelization method is presented

show abstract

How “hard” is thread partitioning and how “bad” is a list scheduling based partitioning algorithm?

Cited by 11 publications

References 23 publications

Practice of parallelizing network applications on multi-core architectures

Practice of parallelizing network applications on multi-core architectures

Automatically Partitioning Threads for Multithreaded Architectures

Automatic Parallelization Techniques Based on Compact Dag Extraction and Symbolic Scheduling

Contact Info

Product

Resources

About