Convolution is good at extracting per-frame representations but discards the temporal order. Most previous Convolution Neural Network (CNN) based video action tracking handle space and time dynamics separately. To remedy this issue, we extend the convolution kernel with a time dimension and formulate the problem to an encoder-decoder structure, which generates the bounding boxes of action regions for each frame by leveraging the pixel-wise annotations. By using the same measure, Cubic Tracker achieves mAP of 64.9% (frame IoU threshold = 0.5), 78.6%(video IoU threshold = 0.2), and 78.3% (video IoU threshold = 0.5) with gains of 3.6%, 0.2%, and 1.4% over the best competitors on J-HMDB. More remarkably, on 2 benchmarks without pixel-level supervision (UCF-Sports and UCF101-24), Cubic Tracker also approaches the performance of the state of the arts, by simply transferring learned model from J-HMDB, with no further adaptive fine-tuning techniques applied.