Localizing and interpreting human actions in videos require understanding the spatial and temporal context of the scenes. Aside from accurate detection, vast sensing scenarios in the real-world also mandate incremental, instantaneous processing of scenes under restricted computational budgets. However, state-of-the-art detectors fail to meet the above criteria. The main challenge lies in their heavy architectural designs and detection pipelines to reason pertinent spatiotemporal information, such as incorporating 3D Convolutoinal Neural Networks (CNN) or extracting optical flow. With this insight, we propose a lightweight action tubelet detector coined TEDdet which unifies complementary feature aggregation and motion modeling modules. Specifically, our Temporal Feature Exchange module induces feature interaction by adaptively aggregating 2D CNN features over successive frames. To address actors' location shift in the sequence, our Temporal Feature Difference module accumulates approximated pair-wise motion among target frames as trajectory cues. These modules can be easily integrated with an existing anchor-free detector to cooperatively model action instances' categories, sizes, and movement for precise tubelet generation. TEDdet exploits larger temporal strides to efficiently infer actions in a coarse-to-fine and online manner. Without relying on 3D CNN or optical flow models, our detector demonstrates competitive accuracy at an unprecedented speed (89 FPS) that is more compliant with realistic applications. Codes will be available at https://github.com/alphadadajuju/TEDdet.