We show how to extend classical work-stealing to deal also with data parallel tasks that can require any number of threads r ≥ 1 for their execution. As threads become idle they attempt to join a team of threads designated for a task requiring r > 1 threads for its execution. Team building is done following a deterministic pattern involving log p possibly randomized steal attempts where p is the number of started hardware threads. Deterministic workstealing often exhibits good locality properties that are desirable to preserve. Threads attempting to join the team for a task requiring a large team may help smaller teams instead of waiting for the large team to form. We explain in detail the so introduced idea of work-stealing with deterministic team-building which in a natural way generalizes classical work-stealing. The implementation is done with standard lock-free data structures, in addition to which only a single extra compare-and-swap (CAS) operation per thread is required as a team is being built. Once formed, teams can stay to process further tasks requiring the same (or smaller) number of threads; this requires no further coordination. In the degenerate case, where all tasks require only a single thread, the implementation coincides with a (deterministic) work-stealing implementation, has no extra overhead, and therefore similar theoretical properties. We demonstrate correctness of the generalized work-stealing algorithm by arguing for deadlock freedom and completeness (all tasks will eventually be executed, regardless of their resource requirement r ≤ p), discuss its load-balancing, task execution order and memory-consumption properties, and discuss a number of algorithmic and implementation variations that can be considered. A prototype C++ implementation of the generalized work-stealing algorithm has been given and is briefly described. Building on this, a serious, well-known contender for a best parallel Quicksort algorithm has been implemented, which naturally relies on both task and data parallelism. On an 8-core Intel Nehalem system, a 16-core AMD Opteron system, a 16-core Sun T2+ system supporting up to * The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013.[Copyright notice will appear here once 'preprint' option is removed.] 128 hardware threads, and a 32-core Intel Nehalem EX system we compare our implementation of the published Quicksort algorithm using fork-join parallelism to a mixed-mode parallel implementation with a data parallel partitioning step using our deterministic team-building work-stealer. Results are consistently better. often by a significant fraction. For instance, sorting 2 27 − 1 randomly generated integers we could improve the speed-up from 5.1 to 8.7 on the large 32-core Intel system, on this system being consistently better than the tuned, task-parallel Cilk++ system.