“…To learn activities from video, earlier work emphasized tracking and explicit body-part models (e.g., [19,23,22]). In parallel, many methods to estimate body pose have been developed, including techniques using nonlinear manifolds to represent the complex space of joint configurations [12,32,3,16,28,29]; in contrast to our work, such methods assume silhouette (backgroundsubtracted) inputs and/or derive models from mocap data, and are often intended for motion synthesis applications. More recently, researchers have considered how activity classes can be learned directly from lower-level spatiotemporal appearance and motion features-for example, based on bag-of-words models for video (e.g., [15,31]).…”