Clustering, duplication, and placement are critical steps in a cluster-based FPGA design flow. Clustering has a great impact on the wirelength, timing, and routability of a circuit. Logic duplication is an effective method for improving performance while maintaining the logic equivalence of a circuit. Based on several novel algorithmic contributions, we present an efficient and effective algorithm named SPCD (simultaneous placement with clustering and duplication) which performs clustering and duplication during placement for wirelength and timing minimization. First, we incorporate a path counting-based net weighting scheme for more effective timing optimization. Secondly, we introduce a novel method of moving a fragment of a cluster (called a fragment level move) during placement to optimize the clustering structure. To reduce the critical path detour during legalization from a more global perspective, we also introduce the notions of a monotone region and a global monotone region in which improvement to the local/global path detour is guaranteed. Furthermore, we introduce a notion of a constrained gain graph to embed all complex FPGA clustering constraints, and implement an optimal incremental legalization algorithm under such constraints. Finally, in order to reduce the circuit area, we formulate a timing-constrained global redundancy removal problem and propose a heuristic solution. Our SPCD algorithm outperforms a widely used academic FPGA placement flow, T-VPack + VPR, with an average reduction of 31% in the longest path estimate delay and 18% in the routed delay. We also apply our SPCD algorithm to Altera's Stratix architecture in a commercial FPGA implementation flow (Quartus II 4.0). The routed result achieved by our SPCD algorithm outperforms VPR by 20% and outperforms Quartus II 4.0 by 4%.