Observing, learning, and imitating human skills are intriguing topics in cognitive robotics. The main problem in the imitation learning paradigm is the policy development. Policy can be defined as a mapping from an agent's current world state to actions. Thus, understanding and performing an observed human skill for a cognitive agent depends heavily upon the learned policy. So far, naive policies that use object and hand models with trajectory information have commonly been developed to encode and imitate various types of human manipulations. These approaches, on the one hand, can not be general enough since models are not learned by the agent itself but rather are provided by the designer in advance. It is also not sufficient to imitate complicated manipulations at the trajectory-level since even the same observed manipulation can have high variations in trajectories from demonstration to demonstration.Nevertheless, humans have the capability of recognizing and imitating observed manipulations without any problem. In humans, the chain of perception, learning, and imitation of manipulations is developed in conjunction with the interpretation of the manipulated objects. To compose a human-like perception-action chain the cognitive agent needs a generic policy that can extract manipulation primitives as well as the essential (invariant) relations between objects and manipulation actions.In this thesis, we introduce a novel concept, the so-called "Semantic Event Chain" (SEC), that derives the semantic essence and the invariant spatiotemporal relations of objects and actions to acquire a perception-action chain. We show that SECs are compact and generic encoding schemes for recognizing, learning, and executing human manipulations by relating them with manipulated objects. SECs basically make use of image sequences converted into uniquely trackable segments. The framework first interprets the scene as undirected and unweighted graphs, nodes and edges of which represent image segments and their spatial relations (e.g. touching or not-touching), respectively. Graphs hence become semantic representation of segments, i.e. objects (including hand) presented in the scene, in the space-time domain. The proposed framework then discretizes the entire graph sequence by extracting only main graphs each of which represents essential primitives of the manipulation. All extracted main graphs form the core skeleton of the SEC which is a sequence table, where the columns and rows correspond to main graphs and spatial relational changes between each object pair in the scene, respectively. SECs consequently extract only the naked spatiotemporal patterns which are basically "essence of an action" and are invariant to the followed trajectory, manipulation speed, or relative object poses.In the perception phase, SECs let a cognitive agent not only recognize and classify different observed manipulations but also categorize the manipulated objects considering their roles exhibited in the manipulations.