Workstation clusters equipped with high performance interconnect having programmable network processors facilitate interesting opportunities to enhance the performance of parallel application run on them. In this paper, we propose schemes where certain application level processing in parallel database query execution is performed on the network processor. We evaluate the performance of TPC-H queries executing on a high end cluster where all tuple processing is done on the host processor, using a timed Petri net model, and find that tuple processing costs on the host processor dominate the execution time. These results are validated using a small cluster.We therefore propose 4 schemes where certain tuple processing activity is offloaded to the network processor. The first 2 schemes offload the tuple splitting activity -computation to identify the node on which to process the tuples, resulting in an execution time speedup of 1.09 relative to the base scheme, but with I/O bus becoming the bottleneck resource. In the 3rd scheme in addition to offloading tuple processing activity, the disk and network interface are combined to avoid the I/O bus bottleneck, which results in speedups upto 1.16, but with high host processor utilization. Our 4th scheme where the network processor also performs a part of join operation along with the host processor, gives a speedup of 1.47 along with balanced system resource utilizations. Further we observe that the proposed schemes perform equally well even in a scaled architecture i.e., when the number of processors is increased from 2 to 64.A cluster of workstations is a distributed memory machine where each node is a stand alone system with CPU, memory attached to the memory bus, and peripherals like disk, and network interface attached to the I/O bus. We assume such a cluster of N nodes. Typical fast cluster interconnects like the Myrinet network interface (NI) [13] found in current day systems have NIs with a programmable network processor (NP), on board memory (SRAM), a host DMA engine (HDMA), an EBUS Interface (64-bit), a send DMA engine (SDMA) and a receive DMA engine (RDMA). The HDMA is used to transfer data across the I/O bus to the node memory. SDMA and RDMA are used to transfer data from the NI SRAM to the communication network (switch), and vice-versa. We assume that the switch employs wormhole routing to transfer packets between network interfaces.Research in communication layers for high performance scientific applications has lead to the development of userlevel communication techniques which have reduced the involvement of HP in communication to deliver better application performance [19]. We, therefore, assume such a userlevel communication layer in our system [7].