Parallel program performance often critically depends on barrier performance. In modern NUMA multi-core machines, barrier synchronization performance is significantly affected by cache-coherence communication between cores, especially when the scale of NUMA systems is large, complex interconnected networks, memory hierarchies, and cache-coherence protocols make optimization of barrier algorithm hard.We propose a general barrier optimization framework on NUMA multi-core machines. The framework splits the barrier into three stages: the barrier arrival within a NUMA node, the barrier arrival across the NUMA nodes, and the wakeup, providing an opportunity to optimize the communication pattern and the cache-line placement in each stage. To reduce remote communication traffic, we introduce a coordinator per NUMA node. In addition, we implement two barrier algorithms based on the framework. Finally, we show the superiority of the barrier algorithms within our framework over other barrier algorithms and show how to translate a barrier algorithm into a performance model to help make an optimal tradeoff design. Experiments were conducted on three NUMA multi-core platforms and the results show that the barrier algorithm optimized within our framework is sufficient to deliver as good or better performance than state-of-art approaches on NUMA multi-core machines.