In this work, we utilize dynamic dataflow/data-driven techniques to improve the performance of high performance computing (HPC) systems. The proposed techniques are implemented and evaluated through an efficient, portable, and robust programming framework that enables data-driven concurrency on HPC systems. The proposed framework is based on data-driven multithreading (DDM), a hybrid control-flow/dataflow model that schedules threads based on data availability on sequential processors. The proposed framework was evaluated using several benchmarks, with different characteristics, on two different systems: a 4-node AMD system with a total of 128 cores and a 64-node Intel HPC system with a total of 768 cores. The performance evaluation shows that the proposed framework scales well and tolerates scheduling overheads and memory latencies effectively. We also compare our framework to MPI, DDM-VM, and OmpSs@Cluster. The comparison results show that the proposed framework obtains comparable or better performance.Systems based on dynamic dataflow/data-driven execution [28], such as DDM, have several advantages over the sequential model of execution: (i) allow asynchronous data-driven execution of fine-grain tasks/threads, and fine-grain programming models have a great potential to efficiently use the underlying hardware [5,33,39,61]; (ii) can expose the maximum degree of parallelism in a program since the dataflow model only enforces true data dependencies [31]; and (iii) can handle concurrency and tolerate memory and synchronization latencies efficiently [10]. Thus, systems based on dynamic dataflow can be used to efficiently exploit the computing power of current and future HPC systems [22,33,39,40,61].In this work we extend the functionalities of DDM to enable efficient and portable distributed data-driven concurrency on HPC systems. The proposed functionalities are implemented in the FREDDO system [45], an efficient C++ implementation of DDM that until recently was supporting data-driven execution on single-node multicore systems. In distributed DDM applications, remote memory accesses are introduced, resulting from producer and consumer DThreads running on different nodes. The distributed FREDDO implementation provides implicit data forwarding [36] to the node where the consumer DThread is scheduled to run. In particular, a consumer DThread can be scheduled for execution only when all of its input data are available in the main memory. This helps to reduce memory latencies [36]. FREDDO is publicly available for download in [42].Distributed FREDDO provides implicit data forwarding through a distributed shared memory (DSM) implementation [54] with a shared global address space (GAS). Coherence operations implemented in typical DSM systems [53] are not required between the nodes because the produced data is forwarded to consumers that will be executed on remote nodes. DSM eases the development of distributed FREDDO/DDM applications that use shared objects/data-structures (scalar values, arrays, etc.). The programmer onl...