Task-parallel systems have been widely used to parallelize programs. They provide automatic load balancing and programmers can easily parallelize sequential programs, including irregular ones, without considering task placement to physical processors.Despite the success of shared memory task parallelism, task parallelism on large-scale distributed memory environments is still challenging. The focuses of our work are flexibility of task model and scalability of inter-node load balancing. General task models provide functionalities for suspending and resuming tasks at any program point, and such a model enables us flexible task scheduling to achieve higher processor utilization, locality-aware task placement, etc. To realize such a task model, we have to employ a threadan execution context containing register values and stack frames-as a representation of a task, and implement thread migration for inter-node load balancing. However, an existing thread migration scheme, iso-address, has a scalability limitation: it requires virtual memory proportional to the number of processors in each node. In large-scale distributed memory environments, this results in a huge virtual memory usage beyond the virtual address space limit of current 64bit CPUs. Furthermore, this huge virtual memory consumption makes it impossible to implement one-sided work stealing with Remote Direct Memory Access (RDMA) operations. One-sided work stealing is a popular approach to achieving high efficiency of load balancing; therefore this also limits scalability of distributed memory task parallelism.In this paper, we propose uni-address, a new thread management scheme for distributed memory task parallelism. It significantly reduces virtual memory usage for thread migration and enables us to implement RDMA-based work stealing. We implement a lightweight multithread library supporting RDMA-based work stealing based on the uniaddress scheme, and demonstrate its lightweight thread operations and scalable work stealing on Fujitsu FX10 supercomputing system with three benchmarks: Binary Task CrePermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. HPDC'15, June 15-20, 2015, ation, Unbalanced Tree Search, and NQueens solver. As a result, we confirmed all the benchmarks works with less than 144KB virtual memory for thread migration in each processor and achieved more than 95% parallel efficiency on 3840 processing cores, relative to the results on 480 processing cores.