The development of machine learning has madea revolution in various applications such as object detection,image/video recognition, and semantic segmentation. Neuralnetworks, a class of machine learning, play a crucial role inthis process because of their remarkable improvement overtraditional algorithms. However, neural networks are now goingdeeper and cost a significant amount of computation operations.Therefore they usually work ineffectively in edge devices thathave limited resources and low performance. In this paper, weresearch a solution to accelerate the neural network inferencephase using FPGA-based platforms. We analyze neural networkmodels, their mathematical operations, and the inference phasein various platforms. We also profile the characteristics thataffect the performance of neural network inference. Based on theanalysis, we propose an architecture to accelerate the convolutionoperation used in most neural networks and takes up most ofthe computations in networks in terms of parallelism, data reuse,and memory management. We conduct different experiments tovalidate the FPGA-based convolution core architecture as wellas to compare performance. Experimental results show that thecore is platform-independent. The core outperforms a quad-coreARM processor functioning at 1.2GHz and a 6-core Intel CPUwith speed-ups of up to 15.69 and 2.78, respectively.